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1. INTRODUCTION 


1.1 Motivation 

Instruction retry is an effective technique to allow rapid recovery from transient faults 
in a processing system. Multiple instruction rollback recovery may be appropriate when 
error detection latencies or when error reporting latencies are greater than a single in- 
struction cycle. Single and multiple instruction rollback recovery has been implemented 
in hardware by researchers and main-frame computer designers. In general, complex 
im plementations strive to minimize impacts to system performance while less complex 
implementations permit some performance impact. Compiler-assisted multiple instruc- 
tion rollback has also been developed which replaces dedicated data redundancy hardware 
with compiler transformations that remove rollback data hazards. This thesis introduces 
a multiple instruction rollback recovery scheme, which uses the compiler to remove some 
rollback data hazards and uses dedicated data redundancy hardware to remove all re- 
maining hazards. The scheme results in less hardware than required for the least complex 
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hardware implementations and performance approaching that of the most complex hard- 
ware implementations. 

1.2 Thesis Contributions 

The contributions of this thesis are grouped into four topics. The first extends previ- 
ous compiler-based multiple instruction rollback to a broad class of code execution fail- 
ures. Data hazards that result from instruction rollback are classified and shown to be of 
two types: 1 ) on-path hazards, and 2) branch hazards. Previous compiler-driven data- 
flow manipulations resolve on-path hazards only. These transformations are extended to 
resolve both on-path and branch hazards. Evaluations of the extended compiler-based 
scheme indicate slightly increased performance impacts over transformations that resolve 
only on-path hazards. Using the hazard classification, a new compiler-assisted scheme is 
proposed which utilizes a hardware implemented read buffer to remove on-path hazards 
and compiler transformations to remove branch hazards. Performance evaluations of the 
new compiler- assisted multiple instruction rollback scheme indicate a lower performance 
penalty than with either a compiler-based approach or a comparable hardware- based 
approach. 

The second develops enhancements to previous compiler transformations used for 
rollback hazard removal. A one-pass node splitting algorithm is developed which uses 
the concept of conflicting parents and graph coloring to eliminate constraints that forced 
previous node splitting algorithms to operate iteratively. Experimental evaluations show 
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that the one-pass node splitting algorithm reduces code growth and achieves a compile- 
time speedup of 30 over iterative algorithms. To complement the one-pass node splitting 
algorithm, a one-pass static loop protection algorithm was also developed along with a 
dynamic loop protection algorithm incorporated into the one-pass node splitting algo- 
rithm. The use of profile data to aid in loop protection decisions was evaluated and found 
to be effective for some cases in improving application run-time performance. 

The third studies the read buffer size requirement. A flexible evaluation methodology 
is developed and used to study ten read buffer configurations. The evaluation methodol- 
ogy used updates a read buffer model at dynamically occurring application instructions 
boundaries. It was found that a 55% read buffer size reduction is achievable with an 
average reduction of 39.5% over the ten applications evaluated, given the most efficient 
read buffer configuration, but that additional control logic to handle read buffer overflows 
may limit the overall hardware savings. 

The final topic of the thesis studies the application of compiler-assisted multiple 
instruction rollback to aid in speculative execution repair. It is shown that the handling 
of speculated excepting instructions and the handling of mispredicted branches in a 
speculative execution architecture are similar to the handling of on-path and branch 
hazards, respectively, in multiple instruction rollback. A speculative scheduling method 
referred to as implicit index scheduling is proposed which utilizes a modified read buffer 
to remove on-path hazards and compiler transformations to remove branch hazards. The 
viability of the read buffer to aid in branch repair is also investigated and shown to 
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be contingent on the expected read buffer flush penalty. Estimates of flush penalties are 
obtained using the same evaluation methodology developed for the read buffer size study. 
Evaluation results indicate that read buffer flush costs under 15% are achievable. 

1.3 Thesis Organization 

Chapter 2 presents a background of error detection strategies and rollback recovery 
schemes. Chapter 3 describes the error model and classifies data hazards that result 
from instruction rollback. Compiler transformations that remove rollback hazards are 
presented, along with resulting performance evaluations of a compiler-only multiple in- 
struction rollback scheme. Chapter 3 also presents a new compiler-assisted multiple 
instruction rollback recovery scheme along with experimental evaluations of the new 
scheme. Chapter 4 describes enhancements to compiler transformations used for rollback 
hazard removal. These enhancements focus on reducing compile times and improving 
run-time performance. Chapter 5 assesses the minimum size requirements of the read 
buffer proposed in Chapter 3 and gives performance evaluations for ten read buffer con- 
figurations. Chapter 6 investigates the viability of applying the new compiler- assisted 
multiple instruction rollback scheme to aid in speculative execution repair. Chapter 7 
contains summary remarks, limitations, and future research directions. 


2. BACKGROUND 


2.1 Error Detection 

Various error detection strategies have been studied, resulting in a range of efficien- 
cies in both implementation complexities and error detection latencies. Error detection 
schemes that utilize error correcting codes [1] or low level functional redundancy [2] typ- 
ically detect errors within the same cycle, however, can result in increased cycle times. 
Some time redundant [3], algorithm-based [4], high level functional redundancy [5], and 
control-flow checking [6, 7] error detection schemes achieve error detection latencies of a 
few cycles without impacting system cycle times. In designs for which error detection 
latencies are greater than a single instruction cycle, multiple instruction rollback recovery 
may be appropriate. 
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2.2 Rollback Recovery 

2.2.1 System-level checkpointing and recovery 

System-level checkpointing is a well- understood method for implementing rollback 
recovery when system errors occur [8-10]. In the case of a detected fault, the system 
is rolled back to a previous checkpoint containing a consistent state of the system [11]. 
System-level checkpointing is implemented in software (typically included in the oper- 
ating system) with the checkpointed system state being stored on stable media such as 
the system disk. To minimize overall system performance impacts given the significant 
overhead associated with taking a checkpoint, checkpoint intervals must be great (from 
from minutes to hours). This strategy permits long error detection latencies, however, 
has the disadvantage of long recovery times and significant lost work during repair. 

2.2.2 Multiple instruction rollback 

When transient processor errors occur, multiple instruction rollback (also referred to 
as multiple instruction retry or simply instruction retry) can be an effective alternative 
to system-level checkpointing and rollback recovery [12, 13]. Multiple instruction retry 
within a sliding window of a few instructions [12], or re-execution of a few cycles [5], 
can be implemented in parallel with concurrent, algorithm-based, or control-flow error 
detection methods for rapid recovery from transient processor errors. Rapid error de- 
tection ensures minimal system state changes between detection and rollback and allows 
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hardware to efficiently save and restore the required system state. Multiple instruction 
rollback recovery is feasible only when error detection latencies are sufficiently small. 

The issues associated with instruction retry are similar to the issues encountered with 
exception handling in an out-of-order instruction execution architecture. If an instruction 
is to write to a register and N is the maximum error detection latency (or exception 
latency), two copies of the data must be maintained for N cycles. Hardware schemes 
such as reorder buffers, history buffers, future files [14], and micro- rollback [12] differ 
in where the updated and old values reside, circuit complexity, CPU cycle times, and 
rollback efficiency. 

2.3 Hardware Implemented Instruction Retry 

Multiple instruction retry and system level checkpointing are similar concepts differ- 
ing in implementation (i.e., hardware versus software) and scope (i.e., the error detection 
latency and amount of system state involved). Similar to system level checkpointing, 
hardware implemented instruction retry schemes belong to one of two groups: 1) full 
checkpointing and 2) incremental checkpointing. Full checkpointing maintains “snap- 
shots” of the required system state space at regular, or predetermined, intervals. Upon 
error detection, the system can be rolled back to the appropriate checkpointed system 
state. Incremental checkpointing maintains changes to the system state in a “sliding win- 
dow”. Upon error detection the system state is restored by undoing, or “backing-out” 
the system state changes up to the instruction in which the error occurred. 
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Several examples from each instruction retry group will be discussed. The exam- 
ples are drawn from research, commercial processors, and patent applications. These 
examples, along with a discussion of compiler- assisted instruction retry, will serve as 
a background and comparison for the proposal of a new multiple instruction rollback 
approach presented in Chapter 3. 

2.3.1 The IBM 4341 

The IBM 4341 supports the capability for single instruction retry by making use of 
a level sensitive scan design (LSSD), which was originally proposed to provide increased 
observability and controllability in LSI circuits and also to make sequential logic oper- 
ations independent of circuit delays and wire delays [15]. Figure 2.1 illustrates both of 
these features. 1 

During normal operation C LK~cl and C LKJ> operate as interleaved, nonoverlapping 
system clocks. CLK.a latches state changes and provides stable inputs to the second 
stage of the latch pair. CLKJb modifies the system state using these stable values. LSSD 
ensures that the steady-state output is independent of the sequence of input signals or 
signal rise/fall times. Scanning is accomplished by substituting CLK^s for C LK.a in 
the normal operation. In this way, a complete system state image can be loaded serially 
through the SCANJn line. Likewise, the current system state image can be obtained 
serially through the SC AN jout line. 


1 Derived from pp. 438, * Logic Design Principles”, E. J. McCluskey [16]. 
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Figure 2.1: LSSD double latch design. 
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The 4341 incorporates three error detection strategies; 1) duplication and compare, 
2) odd-parity, and 3) special error conditions such as invalid combination of control lines. 
When one of these mechanisms detects an error, the system clocks are “frozen. The 
current system state is then scanned out by a separate service processor which makes 
the proper adjustments to the system state. The updated state is then scanned back 
into the processor where the faulty instruction is retried. A more complete description 
of fault handling on the IBM 4341 can be found in [17]. The 4341 is best classified as a 
full checkpointed rollback recovery scheme, with a checkpoint interval of one instruction 
(via LSSD state duplication) and a rollback distance of one. 

2.3.2 The IBM 3081 

The instruction unit of the IBM 3081 establishes a checkpoint every 10 to 20 instruc- 
tions. This checkpoint is then used as a re-entry point for rollback repair in the event 
that an error is detected. The local cache array includes hardware which saves the old 
value of an updated cache line into a push-down array. Flushing the push-down array in 
reverse order restores the cache array to its state at the time of the last checkpoint. This 
method is similar to the history buffer which is described in Section 2.3.8 [14]. 

The 3081 uses LSSD to read and write system state information. Scanning out the 
entire system state, including the general purpose register file (GPRF) and local cache 
arrays, can be very time-consuming. The hardware support previously mentioned obvi- 
ates full scanning. The IBM 3081 implements full checkpointed rollback recovery with a 
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checkpoint interval of 10 to 20 instructions and corresponding rollback distance. A more 
complete description of fault handling on the IBM 3081 can be found in [13, 18]. 

2.3.3 The VAX 8600 

The VAX 8600 uses parity checking on internal buses including the arithmetic logic 
unit (ALU) and shifter. The instruction fetch and decode unit (I box), floating point unit 
(F box), and execution unit (E box) each have a copy of the GPRF. Writes to one GPRF 
cause simultaneous writes to the other two GPRF’s in order to maintain consistency. If 
a parity error is detected in one of the GPRF’s, the other GPRF’s can be used to correct 
the invalid GPRF. 

For performance reasons, parity checking and correction in the local cache (M box) 
occur after the data has been sent to the requester. If an error is detected by the M 
box, it is corrected with Error-Correcting Code (ECC) logic and an error signal is sent to 
the E box so that the instruction can be retried. The VAX 8600 is a pipelined complex 
instruction- set architecture which does not guarantee that all updates to the system state 
are held un til the write state of the pipeline. If the CP U has not performed an operation 
that makes retry impossible, the instruction is retried. The ability to retry is determined 
by the abort bit which is set when the current instruction: 1) is an I/O read, 2) is a 
memory write, or 3) results in a modification of the system state by the E box. The VAX 
8600 can be classified as a full checkpointed rollback recovery scheme, with a checkpoint 
interval of one instruction and a rollback distance of one. A more complete description 
of fault handling on the VAX 8600 can be found in [19]. 
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2.3.4 The VAX 9000 

The VAX 9000 system is organized such that all storage elements, i.e., latches and 
flip flops, are connected to form a “visibility chain.” The system state is therefore visible 
and can be read and written by the service processor through a serial diagnostic bus. The 
visibility chain operates much like a scan ring design [15, 16]. The VAX 9000 is a pipelined 
complex instruction-set architecture. The user visible system state (i.e., memory contents 
and register values) is well-defined at macroinstructions boundaries; however, during 
actual operation, several instructions will be executing at once, with each instruction at 
a different stage in the pipeline. This make identification of macroinstructions boundaries 
difficult. Also, errors cannot be expected to occur at well-defined instruction boundaries. 

The VAX 9000 execution unit (E-box) accepts operands, computes the result, and 
delivers the result for storage. An error interfering with one or more of these operations 
causes a trap when the E-box requests data from the faulty subsystem. If the program 
visible state of the machine has not been modified by the current instruction, the in- 
struction is backed up to the beginning and restarted. This is usually possible since the 
system state typically changes in the final stages of the pipeline. If memory or register 
values were modified early in the pipeline, status flags are updated so that instruction 

retry can be disabled. 

The VAX 9000 can best be classified as a full checkpointed rollback recovery scheme, 
with a checkpoint interval of one instruction and a rollback distance of one. A more 
complete description of fault handling on the VAX 9000 can be found in [20]. 
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MAIN ROUTINE 


Figure 2.2: Checkpoint retry mechanism: U.S. patent number 4,912,707. 

2.3.5 IBM Patent number 4,912,707 

The checkpoint retry me chani sm described in U.S. patent number 4,912,707 deals 
primarily with checkpoint placement in a hardware implemented multiple instruction 
rollback scheme. Many instruction retry schemes have a rollback distance of one instruc- 
tion making checkpoint placement trivial. Nontrivial checkpoint placements are handled 
by software in system level checkpointing but are more difficult in hardware implemen- 
tations without precise placement rules. 

As shown in Figure 2.2, this mechanism automatically takes checkpoints at one of 
three points in a dynamic instruction stream: 1) immediately after a write instruction, 
2) immediately after a return from interrupt, or 3) immediately after the first instruction 
in an interrupt handler routine. This approach avoids the need to take a checkpoint 
after every instruction and ensures that a rollback will never cross a write or interrupt 


boundary. 
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The checkpoint is taken by storing a copy of the GPRF and program status word 
(PSW) into a shadow file, as shown in Figure 2.2. For rollback, the GPRF is restored from 
the shadow file. This checkpoint retry mechanism implements full checkpointed rollback 
recovery with a variable checkpoint interval and corresponding rollback distance. A more 
complete description of the mechanism can be found in U.S. patent number 4,912,707 
[ 21 ]. 

2.3.6 IBM Patent number 4,044,337 

The instruction retry mechanism described in U.S. patent 4,044,337 resolves the prob- 
lem of register file and cache memory corruption due to a defective instruction cycle. 
The mechanism maintains duplicate copies of such data to be used for restoration in the 
event of corruption. 

For each register in the GPRF, two duplicate locations are maintained, as shown in 
Figure 2.3. Each time register r x is written in the GPRF by instruction 7 y , one of these 
two duplicate registers, t' x or r x , is also written with the same value. The duplicate 
locations used, alternate with instruction count, that is, 7 y writes to r' c , 7 y + 1 writes to 
t’x, 7»+2 writes to r'j., etc. If 7 y writes r x more than once, each subsequent write of 7 y 
overwrites the previous value in the duplicate location leaving only the last written value 
of r x in the location upon completion of 7 y . This scheme ensures that, given an error in 
7 y+ i, the GPRF contents which existed prior to the execution of 7 y+ i can be recovered 
from the duplicate register locations. 

2 U.S. patent 4,044,337 refers to register file storage as local store and cache storage as buffer store. 
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Figure 2.3: GPRF duplicate store mechanism: U.S. patent number 4,044,337. 


A mech ani sm similar to the one used for GPRF duplication is used for the cache 
duplication. Since triplication of the cache results in a significant circuit overhead, the 
total cache duplicate store size is significantly smaller than the cache size. As shown 
in Figure 2.4, a store controller accepts the cache address and maps it into the smaller 
duplicate stores. The duplicate stores operate the same as the register file duplicate 
stores,, except that in addition to the cache data, cache addresses are also saved. If a 
cache store occurs when the duplicate stores are full, a replacement algorithm in the store 
controller submits a request for the cache to write out an appropriate cache line to main 
memory. This makes a corresponding duplicate store pair available for the current cache 

write. 

The checkpoint retry mechanism, described in U.S. patent number 4,044,337, imple- 
ments incremental checkpointing and rollback recovery, where changes to the system state 
are maintained and backed out during recovery. The mechanism supports a maximum 
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Figure 2.4: Cache duplicate store mechanism: U.S. patent number 4,044,337. 
rollback distance of one instruction. A more complete description of this checkpoint retry 
mechanism can be found in [22]. 

2.3.7 Delayed write buffer 

The delayed write buffer (DWB), also referred to as micro-rollback [12], establishes 
data redundancy to aid in rollback recovery by delaying writes to the appropriate storage 
location. Given a GPRF DWB of depth N, the last N register writes are contained in 
the DWB while the unmodified copies are maintained in the GPRF. Figure 2.5 gives an 
example of a high-level DWB design, where the DWB is organized as a first-in-first-out 
(FIFO) queue. Included in the illustration is a sample code segment and the resulting 
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Figure 2.5: Delayed write buffer. 


DWB contents. For simplicity, it will be assumed that the example processor is a simple 
pipelined load/store machine, executing one instruction each machine cycle. 

A value written to register r x in the GPRF is denoted as str jval{r x ) and is shown 
in the DWB along with the register address x. Since the most recent value of a register 
may be contained in the DWB, bypass logic is included to inspect the DWB during all 
register accesses to determine if a more recent value is available. If more than one copy 
of an accessed register is present in the DWB, priority logic contained in the bypass unit 


forwards the most recent value to the appropriate source bus. This latter case occurs 
during instruction I 4 where ri is accessed and two copies are present in the DWB. The 
delayed write buffer is very similar in operation to the reorder buffer proposed to aid in 
exception repair for out-of-order execution architectures [14]. 

As long as an error is detected within N cycles (for this example one instruction 
is completed each cycle), the contents of the DWB can be invalidated, restoring the 
GPRF to a precise state [11] prior to the faulty instruction. The DWB can be applied 
to any system storage unit, however, it is best suited to units that are accessed through 
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Figure 2.6: History buffer. 


an address. For storage units that can be accessed directly, e.g., the program counter, 
instruction counter, and program status word, a state history can be maintained in a 
simple first-in-first-out (FIFO) queue. The delayed write buffer can be classified as an 
incremental checkpointing and recovery scheme. 


2.3.8 History buffer 

The history buffer (HB) is derived from the reordering version proposed as an aid 
in exception repair for out-of-order execution architectures [14]. Figure 2.6 illustrates 
a high-level HB design to aid in rollback recovery, along with an example instruction 
sequence and the corresponding HB contents. Each time register r x is written in the 
GPRF, the old contents of r r , denoted as ld.val(r x ), are read out and placed in the HB. 
The HB is organized as an FIFO queue. 

In an HB organization, the most recent values of all registers are maintained in 
the GPRF; therefore, bypass logic is not required as. with the DWB. In a pipelined 
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Instruction 

Sequence 


hr r i = r 2 + r 3 


h : r 2 = r 4 
h : r l = r 2 + r 6 


Figure 2.7: History file. 

architecture, reads from both ports of the GPRF in each cycle are common. An efficient 
HB design requires an extra GPRF read port, complicating the GPRF design. Rollback 
to a precise state is accomplished by flushing the HB in reverse order, up to and including 
the faulty instruction. The flush operation requires many cycles in contrast to the single 
cycle invalidate of the DWB. The HB is classified as an incremental checkpointing and 


recovery scheme. 


2.3.9 History file 

The history file (HF) is derived from the future file proposed to aid exception repair 
in out-of-order execution architectures [14]. Figure 2.7 shows a high-level HF design 
consisting of the system GPRF and a duplicate register file. A write buffer depth of N, 
delays writes to the HF and ensures that the HF state is the precise state of the system 
that existed N cycles in the past. Since the most recent register values are contained in 
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the GPRF and therefore no bypass logic is required. Unlike the HB, no additional read 
port is required for the GPRF, instead a fuU duplication of the GPRF is required along 

with the write buffer. 

Rollback of exactly N instructions is very efficient, requiring a global load from the 
HF to the GPRF and an invalidate of the write buffer. These two operations can be 
performed in a single cycle. Variable rollback requires that the write buffer be flushed to 
the HF, up to but not including the faulty instruction prior to the global load from the 
HF to the GPRF. Like the DWB and HB, the HF is an incremental checkpointing and 
recovery scheme. 

2.3.10 The IBM ES/9000 

Prior to the invention of cache memory, main memory load latency was a significant 
performance limiter in high-end processing systems. In an effort to maximize CPU pro- 
ductivity during main memory accesses, IBM developed out-of-order execution [23]. The 
out-of-order execution feature was dropped by IBM in 1968 after the introduction of the 
cache. 

Due to the availability of higher density technology and the need for higher perfor- 
mance, IBM has once again incorporated out-of-order execution into its high-end ES/9000 
system. The virtual register management and branch misprediction repair schemes in- 
clude data redundancy useful for multiple instruction retry. Although the design point 
for the ES/9000 is to detect errors within the current machine cycle, through ECC, 
parity, or checking, the reporting and recording of the error may take several cycles. 
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Figure 2.8: Virtual register management for the IBM ES/9000. 


The ES/9000 has considerable fault-tolerance capability; however, only the instruction 
retry facility will be discussed here. For a more detailed description of the fault-tolerant 
characteristics of the ES/9000, see [24]. For a complete description of virtual register 
management, see [25,26]. 

The virtual register management system (RMS) maps 16 architectural registers into 
32 physical registers. Architectural registers identified in a decoded instruction locate 
pointers in the decode-time register assignment list (DRAL) shown in Figure 2.8. These 
pointers locate the appropriate physical register from the GPRF. The array control list 
(ACL) has entries which correspond to the entries in the GPRF. The ACL contains 
the required physical register status such as the load state of the physical register (i.e., 
available, pending but not loaded, pending and loaded, assigned), instruction number, 
branch dependence, architectural register assignment, and previous assignment. The 
branch register assignment lists, BRAL_A and BRAL_B, contain exact images of the 
DRAL at the time a branch path is predicted. If the branch is mispredicted, the DRAL 
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ran be restored to its value prior to the branch. The two BRAL’s allow a total of two 
pending branch predictions. A third branch prediction causes a stall until one of the two 
pending branches is resolved. 

The RMS supports multiple instruction retry by holding the availability of physical 
registers until the appropriate instructions have been determined to be fault free. In 
this way, a faulty instruction can never overwrite the contents of an architectural regis- 
ter. Figure 2.9 illustrates this instruction retry feature. 3 Out-of-order execution forces 



Figure 2.9: Instruction retry and recovery for the IBM ES/9000. 


instructions to complete in order but allows instructions to finish in any order. Only 
upon completion is the physical register marked available for reassignment. In Figure 
3 Figure 2.9 source: Proc. 22th Int. Symp. Fault-Tolerant Comput. [24]. 
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2.9, instructions 3 and 4 (/ 3 , h) finish prior to J 2 . I\ completes at time Tl, and a fault 
in h is recorded at time T2. Registers modified by I 2 and h have not been released for 
reassignment; therefore, their contents axe available for the retry sequence shown. The 
instruction rollback and recovery scheme of the ES/9000 can best be described as an 
incremental checkpointing scheme with a variable rollback distance. 

2.4 Compiler- Assisted Rollback Recovery 

2.4.1 Compiler-assisted checkpoint placement 

Recently, compiler-based approaches to checkpointing and recovery have been inves- 
tigated. The studies have been conducted in two areas: 1) system/application level 
checkpointing and recovery, and 2) multiple instruction rollback recovery. As an alter- 
native to system and application level checkpointing, a compiler-assisted checkpointing 
and recovery scheme has been proposed [27]. The scheme uses compile-time information 
to create checkpoints adaptively. Efficient use of compile-time information allows for the 
determination of optimal checkpoint placements, the minimization of checkpoint sizes 
by exploiting large variations in memory usage, and the generation of sparse checkpoint 
code. A training technique was also developed resulting in checkpoints with lower cost 
and higher coverage. The compiler-assisted checkpointing scheme was shown to result in 
reduced checkpoint size while maintaining transparency at the programmer, operating 
system and hardware levels. 
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Figure 2.10: Dependencies and their impact during rollback. 

2.4.2 Compiler-assisted multiple instruction rollback 

In contrast to system level checkpointing, compiler-assisted multiple instruction roll- 
back supports the rollback of a few instructions by using compiler-driven data-flow ma- 
nipulations to remove hazards that result from rollback. 

Figure 2.10 illustrates three data dependencies relative to variable x and their effect 
on rollback hazards. 4 For the flow dependency , the instruction 7, writes variable x and 
then the subsequent instruction Ij reads x. If an error is detected after Ij and rollback 
is below Ij, x has not been modified and there is no hazard. If the rollback is between 7, 
and Ij, again x is correct and no rollback hazard exists. If the rollback is to a position 
above 7,-, x haw been corrupted since it was modified by 7,-. In this latter case, there is 
still no data hazard since x will be rewritten by 7,- prior to its use in Ij, i.e., the variable 
x is dead. Given an output dependency , if an error is detected after Ij and rollback is 


4 For a complete presentation of data-flow properties and manipulation methods, see [28]. 
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below Ij , x is correct and no rollback hazard is present. K the rollback is either between 
Ii and Ij or above x is dead an no hazard exists. 

A data hazard is present, however, given an antidependency , an error detection below 
Ij, and a rollback to a position above In this case Ij corrupts x and after rollback, I, 
uses the corrupted x value. Hardware rollback schemes maintain a redundant copy of x so 
that it can be restored to the correct value in the event of a rollback. By using compiler 
transformations to remove all antidependencies of length < N, where N represents the 
maximum rollback distance, the compiler-assisted rollback scheme removes data hazards 
and the requirement for redundancy hardware. 

Antidependencies are removed at three levels: 1) pseudo code, 2) machine code, and 
3) post-pass. Pseudo code is the code level prior to variables being assigned to physical 
registers. The primary compiler transformation to remove antidependencies is variable 
renaming. For the antidependency case of Figure 2.10, variable x of instruction Ij would 
be renamed x', requiring all subsequent uses of x to be renamed x'. Variable renaming 
becomes difficult when the renaming of x in Ij results in the need to rename x in This 
occurs when backedges exist (i.e., loops) and also through the equivalence relationships 
of variables [29]. These difficulties are handled with node splitting , loop expansion , and 
loop protection transformations [28,29]. 

Once all antidependencies have been removed at the pseudo level, register allocation 
may result in the re-emergence of some antidependencies. An example of this would be 
if renamed variable x' of instruction Ij in Figure 2.10 and variable x of Ii were assigned 
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to the same physical register. To prevent this, arcs are added to the dependence graph 
used in the register allocation algorithm. 

Due to register spills and register save/restore conventions at procedure boundaries, 
some antidependencies remain in the compiler emitted code. These hazards are resolved 
by a post-pass transformation which inserts no-operation (nop) instructions to increase 
the antidependency distance to > N. The post-pass transformation carries a significant 
performance penalty since up to N nop’s could potentially be inserted in a frequently 
executed portion of the code. For this reason, all possible antidependencies are removed 
prior to the post-pass level. The overall performance of the compiler-assisted multiple 
instruction rollback scheme is comparable to that of hardware schemes, with the primary 
advantage being the reduced hardware requirement and the ability to select the rollback 

distance at compile time. 

2.5 Summary 

Several full checkpointing and incremental checkpointing instruction retry schemes 
have been presented as background for the development of a new multiple instruction 
rollback approach. These schemes demonstrate the variety, design trade-offs and contin- 
ued viability of multiple instruction rollback as a key fault tolerance feature. 

This thesis focuses on multiple instruction rollback techniques and specifically compiler- 
based techniques similar to those presented in Section 2.4.2. Previous compiler-based 
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schemes produced average performance impacts similar to comparable hardware imple- 
mented schemes such as the delayed write buffer [12]. The performance impact of the 
compiler-based scheme, however, varied greatly between applications. This thesis extends 
compiler-based multiple instruction rollback recovery to a broad class of code execution 
failures and reduces the average and peak performance impacts observed in previous 
schemes. The new compiler- assisted scheme combines compiler-driven rollback hazard 
removal with hardware implemented hazard removal. 



3. COMPILER-ASSISTED MULTIPLE INSTRUCTION ROLLBACK 


3.1 Introduction 

This chapter extends the compiler-based [29] instruction retry scheme discussed in 
Chapter 2 to include a broad class of code execution failures. 1 Given a more general 
error model, data hazards which occur as a result of multiple instruction rollback are 
formally classified. This classification proves useful in the development of two new mul- 
tiple instruction retry schemes. The first scheme extends the compiler-based scheme 
while the second scheme combines compiler-driven hazard removal with hardware data 
redundancy techniques. The new compiler-assisted approach exploits the unique charac- 
teristics of different hazard types. Ten benchmarks were used to measure the performance 
penalty of hazard resolution. Experimental results indicate that the enhanced compiler- 
based approach ran achieve overall performance consistent with existing hardware and 
compiler-based approaches, and that the new compiler-assisted resolution strategy can 


Portions of Chapter 3 were derived from [30]. 
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achieve superior performance to either a hardware-only or compiler-based instruction 
retry scheme. 

3.2 Error Model and Hazard Classification 

3.2.1 Rollback data hazard model 

The following are assumptions used in the error model: 

1. The maximum error detection latency is N instructions. 

2. Memory and I/O have delayed write buffers and can rollback N cycles. 

3. The states of the program counter and program status word (PSW) are preserved 
by an external recording device or by shadow registers as described in the micro 

rollback scheme [12]. 

4. The CPU state can be restored by loading the correct contents of the register file, 
program counter, and PSW. 

Given the above assumptions, any error which does not manifest itself as an ille- 
gal path in the control-flow graph (CFG) of the program is allowed provided that the 
following conditions are satisfied: 

1. Register file contents do not spontaneously change. 

2. Data cam not be written to an incorrect register location. 
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The following is a list of targeted errors: 

1. CPU errors such as those caused by an ALU. 

2. Incorrect values being read from I/O, memory, the register file, or external func- 
tional units such as the floating point unit. 

3. Correct /incorrect values being read from incorrect locations within the I/O, mem- 
ory, or register file. 

4. Incorrect branch decisions resulting from error types 1, 2, or 3. 

3.2.2 Hazard classification 

The code can be represented as a CFG G(V, E), where V is the set of nodes denoting 
instructions and E is the set of edges denoting control-flow. If there is a direct control- 
flow from instruction t, denoted A, to Ij, where /,• € V and Ij € V, then there is an edge 
€ E. Let dm, „(/„/;) denote the smallest number of instructions along any path 
from Ii to Ij. 

The hazard set H regt of the error model is defined as the set of pseudo registers 
(or m achin e registers) whose values are inconsistent during different executions of an 
instruction sequence due to retry. A formal classification of hazard set H re g t follows. 

Property 1: x € H reg , iff there exists a sequence of instructions A, /j, . . . ,/jv which 
form a legal walk 2 in G such that x is live at A, and x is defined during the walk. 


2 A walk is a sequence of edge traversals in a graph where the edges visited can be repeated [31]. 
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Proof: For the if case, an error occurring in I\ will be detected by In- During the 

retry of I x , x will be in an inconsistent state since it was defined during the walk. Since 

\ 

x is live at I u there is some path along which x is used prior to its redefinition, and since 
x is in an inconsistent state, x € H regt . For the only if case, we suppose the contrary. 
Assume that among all legal walks of length N in G, either x is not live at the beginning, 
or x is not defined during the walk. It then follows that x either has no use, or x is not 
changed. (The error model does not allow a write to a wrong location and the contents 
of register x can not spontaneously change.) Therefore there is no inconsistency problem 
for x, which implies x ^ H regt . 

Property 2: All hazards can be classified as one of two types: 1) those that appear 
as antidependencies of length < iV in G(V,E), referred to as on-path hazards, and 2) 
those that appear at branch boundaries, referred to as branch hazards. These two hazard 
types may overlap. 

Proof: Since x € H, there exists a legal walk W x = I\,h,- - -,I n m G, such that 
x is five at h, and after the execution of I x ,I 2 ,-.-Jn in sequence, x has a different 
value. The latter implies that there is at least one instruction defining x along W x 
(the error model does not allow a write to a wrong location and the content of register 
x can not spontaneously change). Let i be the largest index that /, defines x, where 
t 6 {1, 2, ... , N}. Property 1 implies that there exists a legal walk W 2 in G, beginning 
with h, such that the first instruction Ij along W 2 referring x is a use. Case 1: if 
W 2 CW i, instructions Ij and U constitute an antidependency of length < N, and there 
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Figure 3.1i On-path and branch hazards. 

is an on-path hazard on x. Case 2: if W 2 % W u there exists a branch instruction I k 
between I x and I M . Since U) < N, there is a hazard on x at a branch boundary. 


3.2.3 Definitions and terminology 

An on-path or branch data hazard occurs when U defines variable x, and after roll- 
back, Ij uses the corrupted x value prior to its being redefined. To simplify subsequent 
discussion, such on-path and branch hazards will be denoted h a (i,j,x) and h b (i,j,x) 
respectively. Figure 3.1 illustrates this hazard notation. A few definitions are now pre- 
sented to simplify subsequent discussions. 3 


3 A complete description of data-flow terminology can be found in “ Compilers : Principles, Techniques, 
and Tools”, Aho et al., [28]. More on equivalence can be found in [29]. 
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Definitions: 

1. If /,• defines variable x, then def(i) = x. 

2. If the k th operand of U uses variable x, then use k (i) = x. 

3. If there is some path beginning with h which encounters a use of x prior to a 
definition of x, then x € /ive_in(t). 

4. If there is some path from J, to Ij which does not encounter a redefinition of def(i), 
then def(i) reaches j. 

5. If def(i) reaches j and def(i) = use k (j) for some k, then def(i) reaches U3e k {j). 

6. If def(i) reaches j then t € reaching Jn(j). 

7. If r enami ng def(i) requires the renaming of use k (j) for any k, then equiv(i,j) = 1. 

3.3 Compiler Resolution of On-path and Branch Hazards 

Previously developed compiler transformations restrict hazard resolution to on-path 
hazards [29]. The transformations are performed in four phases. Phase 1 resolves pseudo 
register hazards, phase 2 resolves machine register hazards, phase 3 resolves interprocedu- 
ral register hazards, and phase 4 uses nop insertion to resolve the remaining hazards. The 
expanded error model of Section 3.2 permits branch hazards in addition to on-path haz- 
ards. What follows is a discussion of the viability of these same compiler transformations 
in application to branch hazard resolution. 
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Figure 3.2: Register renaming. 


3.3.1 Pseudo register renaming 

The basic compiler transformation to remove hazards is register renaming. Figure 
3.2 shows how hazard h a (i,j,x) can be removed by renaming def(i) from x to y. It can 
be seen from Figure 3.2 that register renaming is equally effective in resolving branch 
hazard h b (i, k,x) by renaming def(t) from z to y. If h 0 {i,j,x) and h b (i, k, x) coexist as in 
Figure 3.1, both hazards are resolved simultaneously, i.e., given h 0 (i,j,x) and h b (i, k,x), 
resolution of h 0 (i,j,x) through renaming resolves h b (i,k,x). In addition to renaming x 
to y in some uses of x in other instructions must also be renamed to y. The variables 
requiring re nami ng are determined by the equivalence property [29]. If equiv(i,j ) = 1 in 
the examples of Figure 3.2, then usei(j) would ultimately be renamed to y, negating the 
hazard resolution. Equivalence can negate both on-path hazard resolution and branch 


hazard resolution. 
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Figure 3.3: Node splitting. 

To break the equivalence relationship, node splitting and loop expansion transforma- 
tions are used. A loop protection transformation ensures that loop integrity is maintained 
during the node splitting and loop expansion transformations. 


3.3.2 Node splitting 

Given h 0 (i,j,x) or h b (i,j,x), node splitting forces equiv(i,j) = 0. Figure 3.3 shows 
an example of data dependence requiring node splitting and the result of a node splitting 
algorithm. Since def(i) reaches use x (l), renaming x in /, to y forces the the renaming of x 
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disable back edges; 
calculate hazards; 
while (changed) , do; 
changed ■ 0; 
for all x in Hreg * » *1® » 

for all V € G(V,E), do; 
if x & liveJn(V) 
continue; 

if multiple del init ions of x reach V, do; 
split (V) ; 
changed • 1; 

endfor 

endfor 

endwhile 

enable back edges; 


Figure 3.4: Node splitting algorithm. 

to y in I,. Since def(k) also reaches use 1 (/), def(k) must be renamed to y. Finally, def(k) 
reaches usei(j), requiring usei(j) to be renamed to y. Hazard h a (i,j, x) has changed to 
h 0 (i,j,y) but it has not been resolved. Register renaming cannot resolve h Q (i,j,x) or 
h b (i,j,x) given equiv(i,j) = 1. The simple node splitting algorithm shown in Figure 3.4 

forces equiv(i,j) = 0 given h 0 (i,j,x) or h b (i,j,x). 

When two definitions of a hazard variable reach a node, the node is split. The effect 
is an “unzipping” of instructions which stops when the hazard variable becomes dead or 
when a loop header is reached. In the former case, the equivalence relationship can no 
longer be affected by the instruction. In the latter case, a split of the loop header would 
compromise the integrity of the loop. It would be possible to treat the loop as a node 
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and duplicate the entire loop; however, this would result in significant code growth. The 
loop protection algorithm is responsible for ensuring that no loop header is split. 

After node splitting, a hazard node i 4 has a “personalized” path to each of the use 
nodes it reaches. More formally, given h 0 (i,j,x) or hi,(i,j,x), no usefc(m) that is reached 

by def(i) is reached by def(n), where def(n) = def(i). Node splitting does not break direct 

\ 

equivalence. Direct equivalence is equiv(i, j) — 1 such that def(i) reaches usek(j )• For 
h 0 (i,j,x) and x) this occurs when def(i) reaches usek(j) by traversing a loop back 

edge. 5 For this reason, the node splitting algorithm of Figure 3.4 is run with the back 
edges of G(V, E ) disabled. On-path hazards that remain after node splitting are resolved 

with loop expansion. 


3.3.3 Loop expansion 

Loop expansion involves unrolling a loop in an effort to remove on-path hazards 


which remain after node splitting. Figure 3.5 gives an example of a loop which requires 
expansion due to an on-path hazard. It can be seen that the rollback traverses the loop 
back edge. For this example, the loop is unrolled once by duplicating the loop body one 
time. The second copy of the loop body has all occurrences of x renamed to y. The 
length of the loop-carried antidependency present in the original loop has increased from 


''Hazard node i is defined as the node representing /, in G(V, E) given the existence of either h 0 (i, j, x) 
or /»&(»', k,x). 

5 In contrast to on-path hazards, direct equivalence for branch hazards can exist without loop back 
edges; however, was observed to be infrequent for the eleven application studied. The handling of these 
hazards is presented in Chapter 4. 
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loop 

back 

edge 



error 

detected 


Figure 3.5: On-path hazard traversing a loop back edge. 

< N to > N. Loop expansion to resolve on-path hazards results in significant code 
growth, reaching 350% for some applications [29]. 

Although on-path hazards which traverse loop back edges are common, we have ex- 
perimentally observed a low rate of occurrence of branch hazards traversing loop back 
edges. This is due to three factors. The first is that branch hazards are less common 
than on-path hazards. Figure 3.6 shows the percentage of all nodes that are on-path 
and branch hazard nodes given various rollback distances for the QUEEN and PUZZLE 
applications. Details on the evaluation methodology can be found in Section 3.5. The 
second factor is that resolving h 0 (i,j,x) through renaming resolves h b (i,k,x). Since on- 
path hazards are resolved prior to the resolution of branch hazards, many branch hazards 
which traversed loop back edges are no longer present. Finally, the most common code 
structure which results in a branch hazard traversing .a loop back edge also causes the 
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123456789 10 123456789 10 

QUEEN PUZZLE 

Figure 3.6: Percentage of total 7,’s that are on-path and branch hazards. 

loop to be protected by inserting a save/restore pair around the loop. The save/ restore 
pair br eaks the direct equivalence of the branch hazard and thereby resolves it without 
the need for loop expansion. Due to the significant code growth potential of loop expan- 
sion and the infrequency of branch hazards traversing loop back edges, all such hazards 
are left to be resolved in the nop insertion phase. 

3.3.4 Loop protection 

Figure 3.7 demonstrates how loop l is protected from hazard node i, where def(i) = 
x. The loop header will not be split since x £ UveJn(hdrjnode(l)), where hdr.node(l) 
represents the header node of loop l. The loop protection transformation operates on two 
inputs. The first is the pseudo register x which is defined in Ii given hazard h 0 (i,j,x). 
The second is the live-out analysis of the CFG. The loop protection transformation is 
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t = x ~~| save node 

x dead • 

— ^ hdr_node( l) | ^°°P * { 

' # change { 

• ally’s i 

* to f*s } 

x =■ t | restore node 

Figure 3.7: Loop protection from hazard variable x. 

not dependent on the type of hazard which identifies the pseudo register and will protect 
loops from header splitting that would occur as a result of branch hazards. 

3.3.5 Machine registers 

Once hazards have been eliminated through renaming, they may reappear as physical 
registers are assigned. It is also possible that new hazards will emerge. Figure 3.8 shows 
the elimination of on-path and branch hazards by adding arcs to the dependency graph 
used for register allocation. 

3.3.6 Interprocedural hazards 

Interprocedural register saving conventions can create immediate on-path hazards. 
For example, if register r k is read and saved prior to a procedure call, and then initialized 
in the called procedure, an antidependency is created. Previous work used a disjoint 
register block scheme to guarantee that any read prior to a procedure call and any 




Dependence Graph 




Figure 3.8: On-path and branch machine register hazards, 
definition during procedure initialization use registers from different blocks [29]. Branch 
hazards are not immediately created at procedural, boundaries. All remaining branch 
hazards are resolved in the nop insertion phase described in the next section or by the 
post-pass transformation described in Section 3.4.3 (p. 44). 


3.3.7 Nop insertion 

Spill code as a result of register allocation can create on-path and branch hazards. A 
similar problem exists with the stack pointer and frame pointer. Some branch hazards 
may also remain that were unresolved with the loop expansion transformation. On-path 
hazards are resolved by inserting nop instructions directly before the hazard instruction 
so that the rollback will be below the last use of the hazard register. This technique 
does not work for branch hazards since the distance between the definition and the use 
instructions is not relevant. Instead, nop insertion is used to increase the distance from 
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the hazard instruction to its nearest predecessor branch. In this case, a rollback will be 
below the branch. 

3.3.8 Summary 

It has been shown that compiler techniques previously developed to resolve on-path 
hazards are equally effective in resolving branch hazards. A compiler-based multiple 
instruction rollback recovery scheme utilizing theses transformations was developed and 
evaluated. The results of the evaluations are presented in Section 3.5. 

3.4 Hardware-Assisted Hazard Resolution 
3.4.1 The read buffer 

Figure 3.9 shows a hardware scheme to resolve on-path hazards. A read buffer is 
attached to the output ports of the register file. Each time a register is used it appears 
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on the read port and is saved in the read buffer. If a register r k is defined in J, and it is 
an on-path hazard, then r* must have been read within the last N cycles. In this case, 
the read buffer will contain the old value and it is permissible to write the new value into 
the register file. In the event of a rollback of N instructions, the contents of the read 
buffer axe flushed in reverse order and stored back to the register file. For an on-path 
hazard, the path taken after the rollback will be the same as the path taken prior to 
rollback and each read of r k will produce the same value as before. Branch hazards will 
be removed by the compiler transformation presented in Section 3.3. It is assumed that 
the read buffer is an integral part of the register file and any error in the system does 
not corrupt the transfer to the read buffer or its contents. 

In contrast to a history buffer which forces a read of r k prior to writing r k , the read 
buffer monitors the register file ports and stores only the values read as part of the normal 
program flow and, therefore, should not significantly impact the register file performance 
or CPU cycle time. The read buffer is twice the width of a register with a depth of 
This is twice the size of a delayed write buffer, but eliminates the requirement for 
complex bypassing and prioritization logic. 

3.4.2 Covering on-path hazards 

In addition to resolving all on-path hazards, the read buffer will resolve some branch 
hazards. Figure 3.10 shows an on-path hazard and a branch hazard both with definitions 
of x in Ii and uses of z, after rollback, in instructions Ij and respectively. Note that 
if path / is initially taken, the read buffer will contain the old value of x and rollback 
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rollback 


Figure 3.10: Covering on-path hazard. 

would be successful. However if path m is taken, the read buffer will not contain the old 
value of x and rollback would be unsuccessful. If only paths such as / exist, the presence 
of the on-path hazard assures successful rollback or “covers” the branch hazard. In this 
case, resolution of the branch hazard using compiler techniques is not necessary. 


3.4.3 Post-pass transformation 

Given the efficiency of the read buffer in resolving on-path hazards, a post-pass trans- 
formation on assembler-level code becomes possible as a replacement for the nop insertion 
transformation described in Section 3.3.7 (p. 41). The post-pass transformation creates 
on-path hazards when necessary to assure that all branch hazards are resolved by the 
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read buffer. Given one such branch hazard which defines physical register r fe at instruc- 
tion the transformation inserts an MOV r k , r k instruction immediately before This 
guarantees that all paths leading to /,• are like path l in Figure 3.10. 

3.5 Performance Evaluation 

3.5.1 Implementation 

The transformation algorithms presented in Section 3.3 have been implemented in 
the MIPS code generator of the IMPACT C compiler [32]. Transformations resolving 
pseudo register hazards (loop protection, node splitting, and loop expansion) are called 
just before register allocation. Transformations resolving machine register hazards are 
called after the live range constraints have been generated and before physical register 
allocation. The nop insertion algorithm, or post-pass algorithm, is called before the 
assembly code output routine. 

3.5.2 Application programs 

Table 3.1 lists the eleven application programs used in the evaluations. The applica- 
tions were cross-compiled on a SPARCserver 490 and run on a DECstation 3100. Size 
is the number of assembly instructions emitted by the code generator, not including the 
library routines and other fixed overhead. 

The results axe summarized in Figures 3.11 through 3.21 (pp. 50 through 55). Each 
figure contains two plots, the first plot shows the percent of run-time overhead ( Time 
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Table 3.1: Application programs: run-time and code size overhead evaluation. 


Program 

Size 

Description 

QUEEN 

148 

eight-queen program 

WC 1 

181 

UNDC utility 

QSORT 

252 

quick sort algorithm 

CMP 

262 

UNDC utility 1 

GREP 

907 

UNDC utility 

PUZZLE 

932 

simple game 


1826 

UNDC utility 

LEX 

6856 

lexical analyzer 

YACC 

8099 

parser-generator 

TBL 

8197 

table formatting preprocessor 

CCCP 

8775 

preprocessor for gnu C compiler 


OH) of the referenced hazard resolution scheme, and the second plot shows the percent 
of code growth overhead {Size OH) relative to the base values in Table 3.1. 

Four hazard resolution techniques were evaluated. Compiler l resolves on-path haz- 
ards only, using the compiler-driven data-flow manipulations presented in Chapter 3. 
Compiler 2 extends the compiler transformations to resolve both on-path and branch 
hazards. PP (post-pass) disables the compiler transformations and relies solely on the 
post-pass transformation presented in Section 3.4. Comp/PP uses compiler transforma- 
tions to resolve branch hazards, assumes a read buffer to resolve on-path hazards, and 
uses the post-pass transformation to remove remaining branch hazards. 

Due to the excessive compile times of Compiler 1 and Compiler 2 , for large appli- 
cations, the evaluations of these schemes were restricted to applications QUEEN, WC, 
COMPRESS, CMP, PUZZLE, and QSORT. The compiler transformations to resolve 
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branch hazards for Comp/PP have been enhanced to reduced compile times. These en- 
hancements are described in Chapter 4. Both Comp/PP and PP were evaluated for all 
eleven applications. 

3.5.3 Performance analysis 

Compiler transformations used for the removal of data hazards can impact perfor- 
mance in several ways. Loop protection inserts save/restore operations at the head and 
tail of the loop. This increases the path length and, therefore, the run time. Additional 
arcs in the dependency graph can cause more spill code to be generated, increasing mem- 
ory references and cache misses. Nop insertion can be costly since up to N nops could be 
inserted for each unresolved hazard. The insertion of MOV r*, r* instructions to create 
covering on-path hazards in the post-pass transformation also increases path lengths, al 
though typically less than with nop insertions. Finally, the increase in code size, mainly 
due to loop expansion, may cause more run-time cache misses. 

The loop expansion transformation can improve performance over a compiler that 
does not have this optimization technique [33] as demonstrated by the negative run-time 
overhead measurements for COMPRESS, CMP, and PUZZLE, shown in Figures 3.13, 
3.14, and 3.15 (pp. 51 and 52), respectively. Once the loop is expanded, some condition 
rh <»rlf g and index operations can be eliminated. Also the save/restore operations from 
loop protection shorten the live ranges of some registers thus allowing more efficient 
register allocation. Only the latter optimization is implemented in the current software. 
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3.5.4 Results: Compiler 2 

As can be seen in Figures 3.11 through 3.16 (pp. 50 through 52), extending the com- 
piler hazard resolution scheme to include branch hazards introduces little incremental 
performance impact or code growth overhead. Given a rollback distance of 10, resolving 
both on-path and branch hazards using compiler transformations resulted in a maxi- 
mum performance impact of 32.6% and an average performance impact of 12.6%. This 
compares with maximum and average impacts of 35.4% and 15.4%, respectively, for 
compiler-driven on-path hazard resolution only. The maximum code size overhead mea- 
sured for the extended compiler-based was 328% with an average overhead of 207%, for 
a rollback distance of 10. This compares with a maximum and average overhead of 372% 
and 225%, respectively, for the unextended compiler-based scheme. 

These results indicate a small incremental run-time performance overhead and a small 
code size overhead given compiler-based branch hazard removal compared to compiler- 
based on-path hazard removal alone. Three factors account for these small incremental 
impacts. First, on-path hazards dominate in frequency of occurrence. Second, resolving 
an on-path hazard at instruction U through renaming can sometimes resolve a branch 
hazard at instruction /<. Third, resolving on-path hazards with nop insertion may resolve 
a corresponding branch hazard by increasing the distance between the hazard node and 
its nearest predecessor branch node. 
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3.5.5 Results: PP 

Figures 3.11 through 3.21 (pp. 50 through 55) show the run-time and code size 
overheads for each application studied assuming the read buffer to resolve on-path hazards 
and the post-pass transformation described in Section 3.4 to cover all branch hazards. 
The results are worst case in that many of the branch hazards could have been resolved 
with no performance impact using the compiler techniques of Section 3.3. Instead, they 
are resolved by the insertion of MOV instructions which cause a guaranteed, although 
small, performance impact. Given a rollback distance of 10, the post-pass transformation 
produced a maximum performance impact of 7 .69% with an average performance impact 
of 2.43%, significantly below the levels produced by the compiler-based scheme. Code 
growth overhead measurements were correspondingly lower with a maximum overhead 
of 13.0% and an average overhead of 8.59%. 

3.5.6 Results: Comp/PP 

The compiler-assisted scheme achieved consistently low performance overheads across 
all applications and slightly better performance than with the post-pass transformation 
only. Given a rollback distance of 10, the compiler-assisted scheme produced a maximum 
performance impact of 6.57% with an average performance impact of 2.03%, and a max- 
imum code growth overhead of 51.2% with and an average overhead of 15.5%. The run 
time results of PUZZLE, YACC, and CCCP indicate that compiler techniques are still 
useful in reducing run-time performance penalties. These compiler techniques, however, 
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have the disadvantage of requiring recompilation and additional code growth. The pri- 
mary advan tag e of the compiler-assisted and post-pass schemes are their utilization of 
the read buffer to resolve the more frequent on-path hazards. 
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Figure 3.11: Run-time overhead and code size overhead: QUEEN. 
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Figure 3.12: Run-time overhead and code size overhead: WC. 
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Figure 3.13: Run-time overhead and code size overhead: COMPRESS. 
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Figure 3.14: Run-time overhead and code size overhead: CMP. 
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Figure 3.15: Run-time overhead and code size overhead: PUZZLE. 
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Figure 3.16: Run-time overhead and code size overhead: QSORT. 
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Figure 3.19: Run-time overhead and code size overhead: YACC. 
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Figure 3.20: Run-time overhead and code size overhead: TBL. 
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Figure 3.21: Run-time overhead and code size overhead: CCCP. 


3.6 Concluding Remarks 

A compiler-based and a compiler- assisted scheme have been described which support 
multiple instruction rollback with branch recovery. Hazard classification has proved useful 
in construction of the compiler-assisted scheme. Compiler transformations such as pseudo 
register renaming, node splitting, loop protection, and loop expansion were shown to 
be effective in resolving on-path and branch hazards with little performance impacts 
over resolving on-path hazards alone. The compiler-based approach yields performance 
impacts consistent with previous compiler techniques [29] and hardware techniques [12]. 
A hardware assisted scheme was introduced to resolve on-path hazards by maintaining a 
window of instruction read history. 

The hardware assisted scheme introduces little performance impact and reasonable 
additional circuitry. Compiler techniques are used to resolve the remaining branch haz- 
ards with a modest increase in overall compile time. The performance measurements 



indicate that the compiler- assisted scheme can achieve lower performance impact than 
either a compiler-based scheme or a delayed write hardware scheme. It should be noted 
that the combined scheme applies only to the CPU and requires additional hardware to 
maintain the states of the program counter, program status word, etc.. The read buffer 
is twice the size of a delayed write buffer but avoids the requirement for bypassing and 


prioritization logic. 



4. TRANSFORM ENHANCEMENTS 


4.1 Introduction 

In Chapter 3, a compiler-assisted multiple instruction rollback recovery scheme was 
presented. The scheme uses an operand read buffer to resolve on-path rollback hazards 
and uses compiler-driven data-flow manipulations to remove branch rollback hazards. 
This chapter presents enhancements to previously proposed compiler transformations 
used for hazard resolution [29]. These enhancements result in improved compile times 
and improved application run times. 

4.2 Node Splitting 

4.2.1 Iterative node splitting algorithm 

As discussed in Chapter 3, node splitting breaks equivalence relationships which would 
prevent pseudo register renaming, i.e., given h 0 (i,j,x) or hk(t,j, x), node splitting forces 
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Figure 4.1: Iterative node splitting algorithm. 


equiv(i,j) = 0. Figure 4.1 shows an example code sequence requiring splitting and an 
iterative node splitting algorithm. 

The code segment of Figure 4.1 contains two branch hazards. The first hazard involves 
pseudo register x and the second involves pseudo register y. When two definitions of a 
hazard variable reach a node in which the hazard variable is live, the node is split. In 
this case, node splitting to resolve the hazard variable x also resolves the hazard variable 
y . This implies that the hazard set should be recalculated after splitting takes place for 
each hazard register. Previous node splitting algorithms used this iterative algorithm to 
avoid unnecessary node splitting [29]. 
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Subgraph 



Figure 4.2: Node splitting: original subgraph. 

Figures 4.2 through 4.4 demonstrate the effect of the iterative node splitting algorithm 
on an example subgraph. Node splitting relative to hazard variable x ensures that the 
definition of x in node n i and the definition of x in node n 2 do not both reach the same 
use of x in node ns. Node splitting relative to y ensures that the definition of y in node 
n 3 and the definition of y in node n* do not both reach the same use of y in node n 6 . 

Figure 4.3 shows the subgraph after splitting relative to hazard variable x, and Figure 
4.4 shows the subgraph after splitting relative to hazard variables x and y. Although 
the iterative algorithm was initially intended to prevent excessive node splitting, this 
example demonstrates that excessive node splitting is still possible. Figure 4.5 shows an 
optimal subgraph which resolves both hazards with less splitting than produced by the 
iterative algorithm. 
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Optimally Split Subgraph 



Figure 4.5: Optimal node splitting relative to hazard variables x and y. 

4.2.2 Conflict definition 

To ensure minimal splitting, a new node splitting algorithm is developed using the 
concept of conflicting parents. Given a CFG with back edges disabled, let H no< us represent 
the set of all hazard hazard nodes 1 present in a CFG with back edges disabled. A conflict 
exists between node n’s parent nodes, p„ and Pb, if 

• me Hnodet n reaching jout{p a ) for some m, and 

• l e reaching -out (pi,) for some l ^ m, and 

• def(m) = def(l), where def(m) € liveJn(n) 

i A hazard node n is defined an the node representing /„ in G(V,E) given the existence of either 
h 0 (n,m,x) or h k (n,l,x). See Chapter 3 for notation details. 
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Any node with one or more conflicting parents must be split. Note that parent conflicts 
are not based on a single hazard variable. 

Figure 4.6 illustrates the conflict definition. Double arrows represent a hazard pair, 
where x is defined in node p h , and after rollback, is used prior to re-definition in node 
m. Single arrows represent reaching definitions and show that if variable x in node pt> 
is renamed to z, then x in node m would ultimately require renaming to z. Node n, 
of Figure 4.6, has conflicting parents p a and p*. Ensuring that node n does not have 
conflicting parents enables resolution of the hazard using variable renaming. 

4.2.3 Node splitting using graph coloring 

Given the definition of conflicting parents, the node splitting strategy for a particular 
node is to group the parents of that node such that elements within a group do not 
conflict. Each group becomes a parent node for a duplicate of the original node. For 
example, if node n has six parent nodes and these nodes can be organized into three 
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Figure 4.7: Node splitting using graph coloring; QSORT. 
nonconflicting groups, then only three total copies of n are required. Figure 4.7 shows 
node 48 from Lcode emitted by the IMPACT compiler for the QSORT application shown 
in Table 3.1 (p. 46). Node 48 has six parent nodes prior to splitting. These nodes can 
be arranged in a parent conflict graph, where each arc of the graph represents two nodes 
which conflict. Establishing groups can be achieved by finding the minimum coloring of 
the parent conflict graph, i.e., coloring the nodes such that no two nodes connected by an 
arc have the same color. For the example shown in Figure 4.7, three colors are sufficient 
to cover the parent conflict graph, resulting in the splitting of node 48 into nodes 48, 48 
and 48”. 

Determining whether a graph is fc-colorable is NP-complete in general: however, 
linear-time heuristics have been developed. Figure 4.8 shows the heuristic used for col- 
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Given parent_conf lict_graph (V, E) 


int color_graph (parent_conflict_graph) 
graph_struct parent_conf lict_graph; 

{ 

int i, j, k; 

graph struct temp_graph; 
node struct v[MAX_PARENTS] ; 


} 


temp_graph - parent_conflict_graph; 
while (temp graph !- null) { 

v[i] - mTn_degree_node { all V in temp_graph (V,E) } ; 

k - degree of(v[i]); 

delete v[ij and all edges of v[i] from temp_graph (V, E) ; 
++i; 


} 

++k; 

for (j-i; j<0; — j) 

color v [ j ] in parent_conflict_graph 
return (k) ; 


with one of k colors; 


Figure 4.8: Parent conflict graph coloring heuristic. 

oring the parent conflict graph. The heuristic is a modified version of an algorithm used 
for register allocation [28]. The algorithm selects the node with the fewest edges, records 
the node, and then removes it from the parent conflict graph. This process continues 
until the parent conflict graph is empty. If node n has the fewest edges (i.e., k edges), 
then at least k + 1 colors are required to color the graph. One color is required for node 
n and k colors are required for the nodes connected to n. Node n can be removed leaving 
a subgraph. Once the subgraph is colored with k colors, then node n can be colored with 
the remaining color. The reverse order of the node recordings can therefore be used to 
color the parent conflict graph. 
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disable backedges; 
calculate hazards; 

for all nodes n, in a topological traversal, do { 

compute reaching_in set for n from reaching_out set of n s 
parents; 

build a parent conflict graph (PCG) ; 

return to k the # of colors required to color the PCG; 
color PCG with k colors; 
delete node n; 

create k-1 duplicates of n; . 

use coloring to connect parent nodes to n and duplicates; 
if n was a hazard node, add duplicate nodes to hazard list, 
compute reaching_out set for n and duplicates; 

} 

enable backedges; 

Figure 4.9: One*pass node splitting algorithm. 


4.2.4 One-pass node splitting algorithm 

Both live-in (n) and reaching.out(n) analyses are required to identify conflicting par- 
ent nodes. A one-pass node splitting algorithm becomes possible by precalculating live-in 
and and, then, beginning with the root node, splitting in a topological traversal 

of the CFG. The one-pass node splitting algorithm is shown in Figure 4.9. A topological 
traversal ensures than when processing node n, all ancestors of n have been processed 
and no descendants of n have been processed. This latter case ensures that the presplit 
calculation of live-in(n) can be used for parent conflict identification when processing 
a given node. Unlike live-in(n), reaching-out(n) is affected by the splitting of ancestor 
nodes. Since reaching-out(n) is based solely on node n and its ancestors, reaching-out(n) 
can be calculated as node splitting proceeds. If a hazard node is split, each duplicate of 
the node must be added to the Hnodet se t. Since the root node does not have conflict- 
ing parents, a topological traversal of the CFG using the graph coloring node splitting 
technique ensures that no node in the resulting graph has conflicting parents. 
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Table 4.1: Node splitting algorithm comparisons: COMPRESS. 

• Iterative Algorithm run time = 614.0 seconds 

• One-pass Algorithm run time = 20.3 seconds 

• Speedup = 30.2 


Orig. Node Cnt. 

Iterative Alg. 

% Increase 

One-pass Alg. 

% Increase 

547 

601 

9.9 

566 

3.5 

461 

499 

8.2 

496 

7.6 

144 

147 

2.1 

147 

2.1 

181 

209 

15.5 

207 

14.4 

75 

80 

6.7 

80 

6.7 

21 

28 

33.3 

27 

28.6 

45 

79 

75.6 

48 

6.7 


Table 4.1 illustrates the improvement of the one-pass node splitting algorithm over 
the iterative algorithm for the COMPRESS application. The COMPRESS application 
was compiled on a SPARCserver 490 using the IMPACT C compiler [32] and a rollback 
distance of 10. Node count values represent pseudo ( Lcode ) instructions created by the 
IMPACT C compiler before and after splitting. Seven of the 14 COMPRESS functions 
which required splitting are listed. Algorithm run times represent the overall compile 
times given each node splitting algorithm. 

Table 4.1 shows a marginal overall code growth reduction for the one-pass algorithm. 
Although one function demonstrated a significant code growth reduction (6.7% down 
from 75.6%), the function is small and has minimal effect on the overall code size. The 
improvement in compile-time of the one-pass algorithm is more dramatic, resulting in a 
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speedup of 30.2. The compile-time improvement can be explained as follows. If 60 hazard 
variables are present in a given function, the iterative algorithm may require up to 60 
passes through the CFG of that function, including 60 data-flow analysis and hazard 
calculations. Although processing a given node in the one-pass algorithm is slightly more 
complex, a single data-flow analysis calculation and a single pass through the CFG are 

sufficient. 

4.3 Loop Protection 

As discussed in Chapter 3, due to the significant code growth potential of loop expan- 
sion and the infrequency of branch hazards traversing loop back edges, all such hazards 
are left to be resolved by the post-past transformation. Node splitting therefore be- 
comes the dominant pseudo level hazard removal transformation, eliminating the need 
for loop protection to aid in loop expansion. A new loop protection algorithm, aimed at 
maintaining loop integrity only during node splitting, is now developed. 

A similar approach to the conflict definition of Section 4.2 can be used to determine 
if a loop has to be protected. The header node of loop / will be defined as hdr.node(l). 
The requirement to split loop / due to hazard node n will be defined as split(l,n ) = 1 , 
and this occurs given the following conditions: 

• n € Hnodet n reaching Jn(hdr.node(l)) for some n, and 

• m € reaching Jn(hdr.node(l)) for some m ^ n, and 

• def(n) = def(m), where def(n) € liveJn(hdr.node(l)) 
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Figure 4.10: Loop protection from hazard variable x. 


Given the one-pass node splitting algorithm of Section 4.2 and no loop protection 
requirement for loop expansion, loop / will require protection from hazard node n if 
split (l, n) = 1. Figure 4.10 demonstrates how loop l is protected from hazard node n, 
where def(n) = x. The loop header will not be split since x & liveJn(hdr.node(l)). A 
loop protection algorithm, referred to as static loop protection, is shown in Figure 4.11, 
where outer (l) indicates the outer loop of loop /. The static loop protection algorithm is 
executed prior to node splitting. Each loop is processed twice: the first time to record the 
hazard variables for which the loop must be protected, and the second time to protect 
the loop. 


4.3.1 Dynamic loop protection 

Since the static loop protection algorithm shown in Figure 4.11 is executed prior to 
node splitting, it does not predict loop header splits that result from new hazard nodes 
created during ancestor splitting. Given the topological traversal of the one-pass node 
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disable backedges; 

calculate hazards; . . 

for all loops 1, from outer loops to inner loops, do 
for each hazard node n, in hazard_node_set , do { 
if (split (l,n) — Dr do { 

if (n is not in loop_hazard_set (outer (i; )) , 
add n to loop_hazard_set (1) ; 


} 


} 


do { 


for all loops 1, do { . . 

for all nodes n in loop_hazard_set (1) , ao t 
protect loop 1 for hazard variable def(n); 

} 

} 

enable backedges; 


Figure 4.11: Static loop protection algorithm. 


splitting algorithm, loop protection can be performed dynamically when a loop header is 


encountered. 

The set containing all of the nodes in loop l is defined as loop.nodes(l). Protection of 
loop / relative to hazard variable * can affect UveJn(n) for all n € loop.nodes(l). Every 
exit node of loop / in which x is live will have a restore node placed between it and its 
children nodes as shown in Figure 4.10. Changes to UveJn(n) for n € loopjnodes(l) 
is therefore contained to loop /. Figure 4.12 shows a simple dynamic loop protection 
algorithm which includes updating of the presplit live-in analysis result. 


The creation of restore nodes during loop protection can occasionally result in ad- 


ditional branch hazards. Static loop protection ensures that these additional hazards 
are identified and removed by the node splitting algorithm. Dynamic loop protection 
can create branch hazards that are not guaranteed to be identified and removed by the 
node splitting algorithm. The approach used for branch hazard removal therefore is 
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dyn_loop_protect (1, x) 

insert save node s ahead of node hdr_node(l) ; 

copy live in(hdr_node) into live_in (save_node) ; 

copy s into loop_nodes (1) ; 
for all exit nodes e of loop 1* do { 
insert restore node r; 
copy live_in (e) into live_in(r); 
copy r into loop nodes (1) ; 

for all nodes n in loop_nodes (1) , do { 
if (x in live_in (n ) ) , do { 
delete x from live_in (n) ; 
add t to live_in(n); 

} 

} 

return; 


Figure 4.12: Dynamic loop protection algorithm. 


to: 1) execute static loop protection, 2) use dynamic loop protection within the node 
splitting algorithm, and 3) re-execute the node splitting algorithm if previous dynamic 
loop protection resulted in additional hazards. Experimental results have shown that 
re-execution of the node splitting algorithm is rarely necessary. Two executions of the 
node splitting algorithm were sufficient to remove all required branch hazards for the 
eleven applications shown in Table 3.1 (p. 46). 


4.4 Performance Enhancements Through Profiling 

4.4.1 Post- pass transformation versus loop protection 

After hazards are removed by the compiler, some hazards remain and must be re- 
moved using the post-pass transformation. Previous post-pass transformations used nop 
insertions to increase all antidependency distances to > N [29] . Since nop insertion can be 
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Figure 4.13: Post-pass hazard removal using read insertion. 


costly to performance, previous compiler transformations removed all hazards possible, 
leaving only unresolvable hazards to be removed by the post-pass transformation. 

In Chapter 3, a new post-pass transformation was introduced in which nop insertion 
was replaced by read insertions as the primary hazard removal technique. As illustrated 
in Figure 4.13, up to two branch hazards can be removed by a single read instruction. 


The new post-pass transformation is very efficient and in some cases can resolve branch 
hazards with less performance impact than pseudo-level transformations. Figures 4.14 
and 4.15 show performance overhead comparisons between compiler-driven data-flow ma- 


nipulations and the new post-pass transformation. Comp/PP indicates that hazards are 
resolved by the compiler where possible, with the remaining hazards being resolved at 
the post-pass level. PP (post-pass) indicates that compiler transformations have been 
disabled and that all hazards are removed at the post-pass phase. Performance evalua- 
tions were obtained using the methods described in Chapter 3. The TBL application is 
a table formatting preprocessor for nroff, a text processing facility. PUZZLE is a game. 
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Figure 4.14: Run-time overhead: PUZZLE. 
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Figure 4.15: Run-time overhead: TBL. 
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For the PUZZLE application, compiler transformations produce better performance 
than the post-pass transformation alone. For the TBL application, shown in Figure 

4.15, using the post-pass transformation to remove all hazards produces slightly better 
performance than the combination of compiler and post-pass transformations. Hazard 
elimin ation via read insertion introduces a guaranteed but small performance impact due 
to the longer instruction path length. As demonstrated in Figure 4.14, pseudo register 
renaming can eliminate hazards without impacting performance when loop protection is 
infrequent. The save/restore operations of loop protection can result in more performance 
impact than read insertion when loop protection is frequent, as demonstrated in Figure 

4.15. 

Figure 4.16 illustrates the potential effect on performance given the following two 
types of hazard removal: 1) hazard removal using register renaming that results in loop 
protection, and 2) hazard removal using read insertion. If the protected loop is executed 
20 times and the hazard instruction is executed two times, loop protection would require 
the execution of 40 additional instructions, where read insertion would require the exe- 
cution of only two additional instructions. If the loop and hazard instruction execution 
frequencies were reversed, then read insertion would produce more performance impact 
than loop protection. As shown in Figure 4.16, profiling data can be used to aid in loop 
protection decisions. 
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Loop Protection Read Insertion 



Figure 4.16: Loop protection versus read insertion. 


4.4.2 Profiling effectiveness 

Profiled data was included in the pseudo-level transformations of Chapter 3. The pro- 
file data is comprised of both dynamic profile sampling and static prediction. The static 
prediction is used as a supplement for areas of the application code that are unexecuted 
during profile sampling. For static profiling, a loop is assumed to iterate ten times. Inner 
loops, therefore, iterate multiples of 10 times depending on the depth of loop nesting. 
All loop header nodes and hazard nodes are assigned weights based on the profile data. 

Protection of loop l due to hazard node n h is required based on the following con- 
dition: if n h -weighi > 3 * (hdr.nodt{l). weight), then protect loop /. The constant 3 
adjusts the weights to account for both direct and indirect loop protection costs. Direct 
loop protection costs result from the save/restore instruction pair shown in Figure 4.16. 
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Figure 4.17: TBL: profile data used for loop protection decisions. 

Indirect loop protection costs result from: 1) an increased number of hazards which in 
turn required more node splitting and more loop protection, and 2) increased register 
usage due to the save/restore instructions which can result in additional register spills. 
Figure 4.17 shows the run-time overhead for the TBL application with rollback distances 
from 1 to 10. Prof/PP indicates that profiling data was used in loop protection decisions. 

The results show that the use of profile data can improve application performance by 
postponing some hazard resolutions until the post-pass phase. Using profile data to aid 
in loop protection decisions did not produce performance equal to that for the post-pass 
transformation, for the TBL application. As an extension to this work, profile data can 
be used to aid in register allocation. As discussed in Chapter 3, hazards that are present 
after pseudo register renaming are resolved by adding arcs to the register allocation 
dependency graph. These additional constraints can cause additional register pillage and 
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impact performance. Similar techniques to those developed for loop protection can be 
used to enhance register allocation decisions. 

4.5 Summary 

In this chapter, compiler transformations used for the removal of branch hazards have 
been enhanced, resulting in reduced compile times and increased application performance. 
A one-pass node splitting algorithm was developed which uses the concept of conflicting 
parents to reduce the number of duplicate nodes required. A graph coloring heuristic 
was developed to connect split nodes to parents. For the COMPRESS application, the 
one-pass node splitting algorithm resulted in marginally reduced code growth and a 
compile-time speedup of 30 over previous iterative node splitting algorithms. Similar 
techniques used in the node splitting algorithm were used to develop a one-pass static 
loop protection algorithm. Due to the splitting of hazard nodes, it was shown that the 
static loop protection algorithm did not predict all loop header splitting. A dynamic 
loop protection algorithm was developed which allows loops to be protected as they 
are encountered by the node spitting algorithm. It was also shown that read insertion 
used in the post-pass hazard removal phase could produce less performance impact than 
pseudo register renaming when the latter results in loop protection. Profiling was shown 
to be effective in making better loop protection decisions, resulting in improved overall 
application performance. 



5. READ BUFFER SIZE REQUIREMENT 


5.1 Introduction 

In Chapter 3, a compiler-assisted approach to multiple instruction rollback in which 
a read buffer of size 2N (where N represents the maximum instruction rollback distance) 
was used to aid in hazard removal. This chapter examines the size and design of the 
read buffer. A practical lower bound and average size requirement for the read buffer 
are established by modifying the design to save only the data required for rollback. The 
study measures the effect on the performance of ten application programs using six read 
buffer configurations with varying read buffer sizes. Two alternative configurations are 
shown to be the most efficient and differed depending on whether split-cycle-saves are 


assumed. 
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Figure 5.1: Read buffer of size 2 N . 


5.2 Read Buffer Configurations 

Given a read buffer configuration as shown in Figure 5.1, rollback is accomplished by 
first flushing the read buffer back to the general purpose register GPRF in the reverse 
order of which the values were saved. Figure 5.1 shows the two FIFO read buffers above 
the source 1 (SI) and source 2 (S2) buses to better illustrate the buffer’s content given 
the instruction sequence shown. As long as the depth of the dual FIFO read buffers are 
N, redundant copies of the appropriate register values (denoted value(r x )) are available 

to restore the register file given a rollback of < N. 

The read buffer size requirement of 2 N is the worst case. The buffer maintains the 
last N register reads from the GPRF, assuring data redundancy for all values required. 
The read buffer may also save data which is not required during rollback. Register reads 
that must be saved can be determined at compile time. If this information is added to 
the instruction encoding (e.g., as an extra bit field for source 1 and for source 2), then the 





79 



Figure 5.2: Read buffer of size < 2 N . 

read buffer can be designed to save only those values required. As long as the required 
values are maintained for N cycles, a less than 2 N read buffer size design is possible. 

Figure 5.2 illustrates a case in which all register reads do not have to be placed in the 
read buffer. The registers required to be saved are marked with an Since only the 
required values are saved, the read buffer total size can now potentially be less than N. 
In this case, however, the instruction count must also be saved so that the value can be 
maintained for at least N cycles. In the event that the read buffer overflows, the oldest 
value in the buffer must be pushed to memory and a record kept so that during rollback 
the value can be retrieved from memory. Given a dual FIFO depth of M, memory would 
serve the function of the remaining N - M of the two FIFOs. This read buffer design 
reduces the buffer size while introducing potential performance impacts due to buffer 


overflows. 
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A key to the evaluation of a given read buffer design is the set of assumptions made 
relative to overflow handling. For example, if a memory store buffer were assumed, there 
would be no stall if a single FIFO overflowed and the store buffer was available, given that 
the current instruction were not a store. However, if the store buffer were full or if the 
current instruction were a store, then a stall would occur. The problem with including 
a store buffer in the model is that the performance impact measured would depend on 
the store buffer size, clouding the performance impact due to the read buffer alone. The 
same difficulty arises if a cache is included in the model. 

It is assumed in this evaluation that a read buffer overflow will always cause a stall 
of one cycle. If both FIFOs overflow, a stall of two cycles will occur. This simplifying 
assumption is pessimistic relative to a store buffer which may have empty locations, while 
optimistic relative to a full store buffer requiring a write to cache. These assumptions 
guarantee that all measured performance impacts are directly due to changes in the read 
buffer size or configuration. 

5.2.1 Read buffer designs 

The most straightforward design for the read buffer is that of configuration Al, shown 
in Figure 5.3. The obvious problem with configuration Al is that if the FIFO connected 
to SI is full and the current Si value must be saved, a stall occurs due to overflow even 
though the FIFO connected to S2 may have an available entry. Configuration A2 in 
Figure 5.3 resolves this inefficiency by giving both SI and S2 access to either FIFO. 
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Configuration A1 Configuration A2 Configuration B1 



o Configuration Bl: Can store buses SI and S2 simultaneously, 
o Configuration B2: Must stall on second store to single buffer, 
o Configurations C & D: Assumes stall on second store to single buffer. 


Figure 5.3: Read buffer configurations. 
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Configuration B1 also resolves the inefficiency of configuration A1 by having a single 
FIFO with both Si and S2 connected to it. Configuration B1 assumes that the SI value 
and the S2 value can be saved within the same cycle. This would be possible if the SI 
value is saved during the first half of the cycle and the S2 value is saved during the second 
h alf 0 f the cycle. This split-cycle-save assumption is consistent with the design of register 
files which write back during the first half of the cycle and read during the second half 
of the cycle [34], 

Configuration B2 is identical to configuration B1 except that two saves during the 
same cycle are not permitted. If two saves are required during the same cycle (e.g., an 
instruction of the form: r x = r* + r"), then a stall to save the second value occurs. 

Configuration C attempts to lessen the impact due to the bottleneck in configuration 
B2 by adding two single level queues between the SI and S2 buses and the single FIFO. 
Configuration C can absorb one simultaneous save, processing the first in the current cycle 
and the second in the next cycle assuming the next instruction does not also require a 
simultaneous save. Configuration D extends configuration C to allow both Si and S2 
access to either queue. 

5.3 Application Program Execution and Read Buffer Simulation 
5.3.1 Simulation approach 

The read buffer is simulated at the instruction level. Prior to each instruction execu- 
tion, a procedure is called to update the read buffer model. Parameters such as which 


cl' 
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register reads to save and instruction type are passed to the simulation program. The 
drawbacks to this approach are the code growth in the original application program and 
the reduction in application run time. 

The instructions inserted to branch to the simulation procedure prior to each original 
application instruction cannot be added in the high level language. If this were done, 
the one-to-one correspondence between original instructions and simulation procedure 
calls would be lost. Also the simulation procedure cannot be permitted to affect the 
original application by changing register assignments, live ranges, etc. For this reason, 
calculation of hazards and subsequent determination of which register reads should be 
saved are performed at the s-code level (after register assignment) and the appropriate s- 
code level instructions inserted prior to each original s-code instruction of the application 
program. 

5.3.2 Implementation 

To minimiz e the application code growth, a simple s-code sequence (written for the 
MIPS 2000/3000 architecture) shown in Figure 5.4 is inserted prior to each instruction. 
The code segment pushes register 31 and register 4 on the stack, loads register 4 with in- 
formation relative to the saving of SI or S2 for this particular instruction, calls rbu/Ssave, 
and then upon return from rbu/Ssave restores registers 31 and 4. Register 31 is used 
as a return address during procedure calls and therefore will be corrupted. Register 4 is 
used to pass parameters in the MIPS compiler convention. 
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subu $sp, 28 

sw $31, 20 ($sp) 


sw $4, 24 ($sp) 

li $4, 1 j directs read buffer to 

jal rbuf2_save save source 1 value 

lw $31, 20 ($sp) 

lw $4, 24 ($sp) 

addu $sp, 28 

# End instrument segment . 



original instruction 


0 


Figure 5.4: Instrumentation code segment. 
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* Begin rbuf2_save procedure 
.verst amp 2 10 
.extern _iob 60 
.extern __pctype 4 

.extern _ctype 0 

.text 
.align 2 

.file 2 "rbuf2_save.c" 
.globl rbuf2_save 
. loc 2 10 
.ent rbuf2_save 2 
rbuf 2_save : 

.option 01 

subu $sp, $sp, 160 

sw $31, 16($sp) 

sw $30, 20 ($sp) 


sw 


$ 2 , 


132 ($sp) 


.mask 0x8ffffff, -4 
.frame $sp, 160, $31 
.loc 2 11 

.livereg 0x8ffffff , Oxf ff 


jal rbuf2_sim 


• 

• 

• 

• 

• 

.loc 

2 12 


• 

• 

lw 

$31, 

16 ($sp) 

• 

• 

■ 

a 

Iw 

$30, 

20 ($sp) 

• 

• 


• 


• 

• 


• 


• 

• 

• 


• 


• 

■ 

• 

lw 

$2, 

132 ($sp) 

■ 

• 

• 

addu 

$sp. 

$sp, 1 

• 

• 

• 

j 

$31 


* 

a 

• 

• 

• 

.end 

rbuf2_ 

save 


C-level read buffer 
simulation program 


Figure 5.5: rbuf2save code segment. 


The code sequence of Figure 5.4 only saves the two registers necessary to branch to a 
procedure. Prior to calling the read buffer simulation procedure, the remaining registers 
which are used must be saved. This was not done in the code segment of Figure 5.4 
to limit application code growth. The code sequence, rbuf2save, shown in Figure 5.5 
conservatively saves all remaining registers on the stack. Both callee and caller saved 
registers are saved since the standard conventions are corrupted by the code insertion. 
The read buffer simulation procedure, rbuf2sim , is called from the code segment shown 
in Figure 5.5. rbuf2sim cam now be modified and re-compiled without a corresponding 
modification to the application program or the two previous s-code segments. 
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Similar s-code segments handle initialization and summary calculations. The initial- 
ization procedure call is placed in the “mam” module prior to the first instruction. The 
summary procedure calls are placed prior to all a jol exif instructions in all modules and 
prior to the $31” instructions in the u main” module. Performance impact (% increase) 


is computed as 


100 * 


stalLcydes 

base-cycles 


Stall cycles result from read buffer overflows. All instructions are assumed to require 
one cycle to complete in a pipelined architecture. This is a pessimistic assumption for 
performance impact measurement since load and branch delays would give the read buffer 
an extra cycle to handle an overflow. The assumption is again made to help isolate read 
buffer effects on performance from those of various delay slot filling strategies. 

The hazard analysis transformation operates on the s-code emitted by the MIPS code 
generator of the IMPACT C compiler [32]. The transformation determines which register 
reads should be saved by the read buffer and inserts calls to the initialization, simulation, 


and summary procedures as described earlier. The resulting s-code modules are then 
compiled and run on a DECstation 3100. For the study, a rollback distance of 10 was 
selected. Given a rollback distance of 10, a read buffer size of 20 (for configurations Al, 


A2, and Bl) will produce zero performance impact. Table 5.1 lists the ten application 
programs studied. Size is the number of assembly instructions emitted by the code 
generator, not including the library routines and other fixed overhead. 
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Table 5.1: Application programs: read buffer size study. 



Description 

QUEEN 


eight-queen program 

WC 


UNIX utility 

QSORT 1 


quick sort algorithm 

CMP 


UNIX utility 

GREP 


UNIX utility 

PUZZLE 


simple game 

COMPRESS 


UNIX utility 

LEX 

idsa 

lexical analyzer 

YACC 

mm 

parser-generator 

CCCP 1 

| 8775 

preprocessor for gnu C compiler 


5.4 Results and Analysis 
5.4.1 Detailed analysis: QUEEN 

Figure 5.6 shows changes in performance overhead (Cycles OH) for various read buffer 
sizes and configurations running the QUEEN application. Looking at Figure 5.6, con- 
figuration Al, it can be seen that significant performance impact is incurred even with 
a modest reduction in read buffer size. As can be seen from the other application runs, 
shown in Figures 5.8 through 5.16 (pp. 93 through 97), configuration Al is consistently 
the least efficient of the six configurations studied. 1 This is due to the fact that the dual 
FIFO’s are dedicated to a single source bus. In many cases saving SI will cause an over- 
flow because the SI FIFO is full, even though there is room in the S2 FIFO. Configuration 
Al does allow for simultaneous saves of Si and S2, given sufficient room in each, but this 
feature does not compensate for the latter inefficiency. Configuration A2 demonstrates 


x An efficient configuration is one with a low performance overhead given a small read buffer size 
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Figure 5.6: Cycle overhead: QUEEN. 

the improvement gained by allowing either source bus access to either FIFO. Configura- 
tion B1 was the most efficient of the six configurations for the QUEEN application. In 
this configuration a total read buffer size of 13 would produce zero performance impact 
with a 35% reduction in read buffer size. 

Comparing configurations A2 and B1 of Figure 5.6, it can be seen that configuration 
A2 out-performs configuration B1 at the smaller buffer sizes while B1 performs slightly 
better at the larger buffer sizes. Figure 5.7 illustrates the operation of read buffer con- 
figurations A2 and B1 given an example instruction sequence. Instruction operands that 
require saving are marked with an For the instruction sequence shown, a read buffer 
size of two, and a maximum rollback distance of four, it can be seen that configuration 
B1 results in one extra overflow. In configuration A2, value(n) is loaded into the S2 
buffer and remains there during subsequent loads of the Si buffer. After instruction / 4 , 
va lue{r b ) becomes invalid since a rollback of four instructions would be to I 2 . A similar 
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Instruction Sequence 
It 


r « * r « + r b 


h:- r c = r 

ly r t = r 
! 4 : r= r 


+ r. 


+ r. 


+ r. 


Configuration A2 


51 

52 







value(r o ) valuer.) | 


V r i - r . + r j 


r and r. become invalid 
(ma ximu m rollback = 4) 


overflow value(r 4 ) 
overflow value(r c ) 
overflow value(r a ) 

extra overflow 


Configuration B1 

51 

52 


5 


value(r i ) 


valuefr • ) 


T 


overflow value(r e ) 
overflow yaluefr) 
overfl ow valuefr^) 
overflow vaiuefr^) 


Figure 5.7: Read buffer configurations A2 and Bl: buffer size — 2. 
scenario is possible with configurations Al. Due to the arrangement of configuration Bl, 
value(r b ) overflows before it becomes invalid, resulting in one extra overflow. As the 
buffer size increases, values in the Bl buffer have more time to become invalid before 
they reach the head position; extra overflows become less frequent. 

Configuration Bl can also produce less overflows than configuration A2. When two 
operands require saving in the buffer of configuration Bl, the head position and the 
head - 1 position of the Bl buffer are checked to see if those positions contain valid 
data. When two operands require saving in the buffers of configuration A2, the head 
positions of the SI buffer and S2 buffer are checked. Configuration A2 can require and 
extra overflow if one of its two head positions have valid data and one of the head - 1 
positions has invalid data. The advantage of configuration Bl (relative to configuration 
A2) becomes visible at large read buffer sizes where the previously mention disadvantage 
of configuration Bl (relative to configuration A2) diminishes. 
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This characteristic of configuration A2 versus configuration B1 is present m most of 
the application results. It should be noted that configuration B1 assumes that simulta- 
neous saves of SI and S2 can be handled within the same cycle. If this latter assumption 
is invalid, Figure 5.6, configuration B2, shows that no less than 9.4% performance im- 
pact is achieved regardless of the read buffer size. The “leveling off" of B2 is due to the 
bottleneck at the single FIFO entry point and not the depth of the FIFO. The flat part 
of the curve shows the percent of instructions requiring simultaneous saves of SI and S2 

in the QUEEN application. 

Figure 5.6, configuration C, shows how a single level dual queue placed between the 
source bus and the single FIFO can alleviate some of the bottleneck effects. The dual 
queue can absorb a single simultaneous save of SI and S2, distributing the saves over 
multiple cycles. A nonzero minimum performance overhead is still present due to cases 
in which the dual queue has not emptied before the next simultaneous save occurs. 

Figure 5.6, configuration D, shows the results of an improved queue structure which 
permits saves from either bus into either queue. This configuration avoids stalls in some 
cases (e.g., S2 must be saved while the queue dedicated to S2 in configuration C is full and 
the other queue is empty). Configuration D also has a nonzero minimum performance 
overhead but gives better performance than configuration C. 

The simulation results for QUEEN show that configuration A1 is the least efficient 
and that given the ability to do split-cycle-saves, configuration B1 is the most efficient. 
Without the split-cycle-save capability, configuration P is the best of the single FIFO 
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Table 5.2: Result Summary. 



RB.size 

OH-level (%) 

Program 

A2 Bl 

A2 Bl ^ 

QUEEN 

14 

12 

1.66 

1.36 

WC 

10 

8 

0.00 

2.54 

QSORT 

16 

15 

2.28 

0.94 

CMP 

12 

11 

0.00 

0.00 

GREP 

10 

10 

0.18 

0.18 

PUZZLE 

10 

9 

2.87 

0.32 

COMPRESS 

12 

12 

2.87 

1.12 

LEX 

12 

12 

2.73 

1.55 

YACC 1 

16 

15 

1.07 

0.00 

CCCP 

12 

12 

2.34 

1.74 


designs resulting in a minimum performance overhead of 4.5%, and configuration A2 is 
the best of the dual FIFO designs resulting in a 1.7% performance overhead with a read 
buffer size of 14. For configurations Bl, B2, C, and D, a total read buffer size of 13 is 
sufficient to maximize performance. 2 


5.4.2 Evaluation of all application programs 

Results for the other nine application programs are similar to those for QUEEN and 
can be found in Figures 5.8 through 5.16 (pp. 93 through 97). The differences between 
the application results are the points at which the curve “levels off” (i.e., the buffer size) 
and, in the case of configurations B2 through D, at what level the performance overhead 
stabilizes. Table 5.2 summarizes measurements obtained for the ten applications given 
the two most efficient configurations, A2 and Bl. It is assumed for this study that 


2 Two must be added to each read buffer size value in C and D to account for the queues. 
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minimal performance overhead can be tolerated as a result of read buffer size reduction. 
For this reason, configuration comparisons are made at read buffer size values which 
produce low values of performance overhead. Configuration A2 does not level off like 
configuration D and does not rapidly approach zero like configuration Bl. For a better 
comparison of configurations A2 and Bl, Table 5.2 gives the read buffer size value where 
the performance overhead value drops below 3%. The read buffer size value is referred 
to as RBsize and the performance overhead value is referred to as OHJevel. 

It can be seen from Table 5.2 that the read buffer size requirement is roughly the 
same, per application, regardless of the split-cycle-save assumption (i.e., comparing con- 
figurations A2 and Bl). The size requirement is application dependent - from 8 for WC, 
to 15 for QSORT and YACC. The measurements show that a considerable reduction 
in read buffer size is achievable. Given the split-cycle-save assumption and configura- 
tion Bl, a minimum of 25%, a maximum of 60%, and an average of 42% reduction was 
achieved. For configuration A2 and no split-cycle-save assumption, a minimum of 20%, a 
maximum of 50%, and an average of 38.0% reduction was achieved. The measurements 
indicate that care should be taken relative to the ultimate selection of read buffer size. 
Given the steepness of the Bl curve around the RBsize value, small decreases in size 
can produce large performance overheads. 

As seen in these results, the full 2 N read buffer size is not required for full on-path 
hazard resolution and negligible performance overhead given a wide variety of applica- 
tion programs. Slightly smaller read buffer sizes are possible given the split-cycle-save 
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capability. As indicated by our measurements, placing a single level queue between the 
source buses and a single FIFO (configurations C and D) was not as effective as a dual 
FIFO where each source bus has access to each FIFO (configuration A2). When the split- 
cycle-save capability was not assumed and a single FIFO was used, QUEEN, QSORT, 
COMPRESS, LEX, YACC, and CCCP showed moderate performance overheads regard- 
less of buffer size. 
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Figure 5.8: Cycle overhead: WC. 
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Figure 5.9: Cycle overhead: QSORT. 
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Figure 5.10: Cycle overhead: CMP. 
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Figure 5.11: Cycle overhead: GREP. 
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Figure 5.12: Cycle overhead: PUZZLE. 
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Figure 5.15: Cycle overhead: YACC. 



Figure 5.16: Cycle overhead: CCCP. 
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5.5 Summary 

By adding extra bits to the operand field for source 1 and source 2, it becomes possible 
to design the read buffer proposed in Chapter 3 to save only those values required, 
thus reducing the read buffer size requirement. The performance cost of the buffer size 
reduction is occasional read buffer overflows which result in stall cycles. Results show 
that two read buffer configurations were the most efficient. The dual FIFO with source 
bus access to each and the single FIFO with the split-cycle-save capability consistently 
out-performed the other configurations. There were moderate variances between the 
buffer sizes required for minimum performance impact between the ten applications and 
the performance stabilization value assuming no split-cycle-save capability. Up to a 55% 
read buffer size reduction is achievable with an average reduction of 39.5% given the 
most efficient read buffer configuration for the applications. It was also found that given 
the split-cycle-save assumption and single FIFO configuration, significant changes in the 
performance overhead result from small changes in the read buffer size. This indicates 
that care should be taken in the final selection of read buffer size in any given design. 
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6. MIR TECHNIQUES APPLIED TO SPECULATIVE EXECUTION REPAIR 


6.1 Objectives 

Speculative execution is an effective method to increase instruction level parallelism 
which can be exploited by both super-scalar and VLIW architectures. The key to a 
successful general speculation strategy is a repair mechanism to handle mispredicted 
branches and accurate reporting of exceptions for speculated instructions. Speculative 
execution repair (SER) strategies have been proposed which trade-off speculation scope, 
hardware complexity, and software complexity. Many of the difficulties encountered 
during recovery from branch misprediction, or from instruction re-execution due to ex- 
ceptions, are similar to those encountered during multiple instruction rollback (MIR). 
This chapter investigates the applicability of compiler-assisted instruction rollback to aid 


in SER. 
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6.2 Introduction 

Super-scalar and VLIW architectures have been shown effective in exploiting instruc- 
tion level parallelism (ILP) present in a given application [32, 35, 36]. Creating additional 
ILP in applications has been the subject of much study in recent years [37-39]. Code 
motion within a basic block is insufficient to unlock the full potential of super-scalar 
and VLIW processors with issue rates greater than two [32]. Given a trace of the most 
frequently executed basic blocks, limited code movement across block boundaries can 
create additional ILP at the expense of requiring complex compensation code to ensure 
program correctness [40]. Combining multiple basic blocks into superblocks permits code 
movement within the superblock without the compensation code required in standard 
trace scheduling [32]. 

General upward and downward code movement across trace entry points (joins) and 
general downward code motion across trace exit points (branches, or forks) is permitted 
without the need for special hardware support [40]. Sophisticated hardware support is 
required, however, for upward code motion across a branch boundary. Such code motion 
is referred to as speculative execution and has been shown to substantially enhance per- 
formance over nonspeculated architectures [41-43]. This chapter focuses on the support 
hardware for speculative execution, which is responsible to ensure correct operation in 
the presence of excepting speculated instructions and mispredicted branches. 
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Tj in live_out of taken path speculated instruction traps 


Figure 6.1: Speculative execution. 

6.3 Speculative Execution 

Figure 6.1 illustrates the two basic problems which are encountered when attempting 
upward code motion across a branch. First, if the speculated instruction (i.e., an instruc- 
tion moved upward past one or more branches) modifies the system state, and due to 
the branch outcome the speculated instruction should not have been executed, program 
correctness could be affected. Second, if the speculated instruction causes an exception, 
and again due to the branch outcome, the excepting instruction should not have been 
executed, program performance or even program correctness could be affected. 

6.3.1 Branch repair 

Figure 6.2 shows an original instruction schedule and a new schedule after speculation. 
Instructions d, i, and / have been speculated above branches c and g from their respective 
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Figure 6.2: Branch repair. 



fall-through paths. 1 Speculated instructions are marked “(s). B The motivation for such 
a schedule might be to hide the load delay of the speculated instructions or to allow more 
time for the operands of the branch instructions to become available. If c commits to the 
taken path (i.e., it is mispredicted by the static scheduler), some changes to the system 
state that have resulted from the execution of d, i, and /, may have to be undone. No 
update is required for the PC; execution simply begins at j. If instead, c commits to the 
fall-through path but g commits to the taken path, then only t’s changes to the system 
state may have to be undone. 

Not all changes to the system state are equally important. If for example, d writes to 
register r x and r x £ UveJn(j), then the original value of r x does not have to be restored. 
Inconsistencies to the system state as a result of mispredicted branches exhibit similarities 

i For this example it is assumed that the fall-through paths are the most likely outcome of the branch 
decisions at c and g. 
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to branch hazards in multiple instruction rollback. The two differences in how branch 
hazards are determined for speculative execution are: 1) the walk to record variable 

i 

assignments, described in Properties 1 and 2 of Section 3.2, begins at the immediate 
predecessor of the branch in question (I b ) and proceeds in a backwards progression, i.e., 
u ... Wlb-N, and 2) only speculated instructions are considered in the walk. 

The walk distance N for speculative execution is the maximum distance from ^A-i to 
(*)/ n along any backwards walk, where J n was speculated above h- A branch hazard 
h b (d,j) exists in Figure 6.2 if d writes to register r, and j reads r,. 

Given the similarity between branch hazards due to instruction rollback and branch 
hazards due to speculative execution, compiler-driven data-flow manipulations, similar 
to those presented in Chapter 3, can be used to resolve branch hazards that result from 
speculation. Such compiler transformations have been proposed for branch misprediction 
handling [42]. Since re-execution of speculated instructions is not required for branch 
misprediction, compiler resolution of branch hazards becomes a sufficient branch repair 
technique. 

6.3.2 Exception repair 

Figure 6.2 also demonstrates the handling of speculated trapping instructions. If 
d is a trapping instruction and an exception occurred during its execution, handling 
of the exception must be delayed until c commits so that changes to the system state 
are minimized, and in some cases to ensure that repair is possible in the event that 
c is mispredicted. If c commits to the taken path, the exception is ignored and d is 
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handled like any other speculated instruction given a branch mispredict. If c was correctly 
predicted, three exception repair strategies are possible. The first is to undo the effects 
of only those instructions speculated above c (i.e., d, i, and /) and then branch to a 
recovery block RB.c [43] as shown in Figure 6.2. The address of the recovery block can 
be be obtained by vising the PC value as an index into a hash table. This strategy ensures 
precise interrupts [14, 44] relative to the nonspeculated schedule but not relative to the 
original schedule. Recovery blocks can cause significant code growth [43]. The second 
strategy undoes the effects of all instructions subsequent to d (i.e., *, b, and /), handles 
the exception and resumes execution at instruction t [42]. This latter strategy provides 
restartable states and does not require recovery blocks. A third exception repair strategy 
undoes the effects of only those subsequent instructions that are speculated above c (i.e., 
only » and /), handles the exception, and resumes execution at instruction t, however, this 
time only executing speculated instructions until c is reached. The improved efficiency 
of strategy 3 over that of strategy 2 comes at the cost of slightly more complex exception 
repair hardware. 

When a branch commits and is mispredicted, the exception repair hardware must 
perform three functions: 1) determine whether an exception has occurred during the 
execution of a speculated instruction, 2) if an exception has occurred, determine the 
PC value of the excepting instruction, and 3) determine which changes to the system 
state must be undone. Functions 1 and 2 are similar to error detection and location in 
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Figure 6.3: Exception repair. 

multiple instruction rollback. Function 3 is similar to on-path hazard resolution in mul- 
tiple instruction rollback. As discussed in Chapter 3, on-path hazards assume that after 
rollback, the initial instruction sequence from the faulty instruction to the instruction, 
where the error was detected, is repeated. 

Figure 6.3 illustrates the speculation of a group of instructions and re-execution strat- 
egy 3. The load instruction traps, but the exception is not handled until the branch 
instruction commits to the fall-through path. Control is then returned to the trapping 
instruction. This scenario is identical to multiple instruction rollback where an error oc- 
curs during the load instruction and is detected during the branch instruction. For this 
example, only n must be restored during rollback since r 4 and r 5 will be rewritten prior 
to use during re-execution. Figure 6.3 shows that exception repair hazards in speculative 
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execution are the same as on-path hazards in multiple instruction rollback, and a read 
buffer as described in Chapter 3 can be used to resolve these hazards. The depth of the 
read buffer is the maximum distance from I b to /„ along any backwards walk, where I n 
is a trapping instruction and was speculated above branch instruction h- 

6.3.3 Schedule reconstruction 

Assumed in Figures 6.2 and 6.3 are mechanisms to: 1) identify speculative instruc- 
tions, 2) determine the PC value of excepting speculated instructions, and 3) determine 
how many branches a given instruction has been speculated above. An example of the 
latter case is shown in Figure 6.2 where instructions d, t, and / , are undone if c is 
mispredicted; however, only i must be undone if g is mispredicted. 

If the hardware had access to the original code schedule, the design of these mecha- 
nisms would be straightforward. Unfortunately, static scheduling reorders instructions at 
compile-time and information as to the original code schedule is lost. To enable recovery 
from mis predicted branches and proper handling of speculated exceptions, some infor- 
mation relative to the original instruction order must be present in the compiler-emitted 
instructions. This will be referred to as schedule reconstruction. 

By limiting the flexibility of the scheduler, less information about the original schedule 
is required. For example, if speculation is limited to one level only (i.e., above a single 
branch), a single bit in the opcode field is sufficient to indicate that the instruction has 
been moved above the next branch [41]. The hardware would then know exactly which 
instruction effects to undo (i.e., the ones with this bit set). Also, removing branch hazards 
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directly with compiler-driven data-flow manipulations permits general speculation with 
no schedule reconstruction for branch repair [42]. 

6.4 Implicit Index Schedule Reconstruction 

Implicit index scheduling supports general speculation of regular and trapping in- 
structions. The scheme was inspired by the handling of stores in the sentinel scheduling 
scheme [42] and was designed to exploit the unique properties of the read buffer hardware 
design described in Chapter 3. Schedule reconstruction is accomplished by marking each 
instruction speculated or nonspeculated and using this marking to maintain an operand 
history of speculated instructions in a FIFO queue called a speculation read buffer (SRB). 
The SRB operates similar to a read buffer with some additional provisions for exception 

handling. 

6.4.1 Exception repair using the speculation read buffer 

Figure 6.4 shows an original code schedule and two speculative schedules, along with 
the contents of the SRB at the time branches I c and I, commit. Instructions I d and 
Ij have been speculated above branch instruction I c , and J, has been speculated above 
both I g and I c . Speculated instructions are marked. This marking identifies the source 
operands to be saved in the SRB. Along with the source operand values and corresponding 
register addresses, the PC of the speculated instruction is also recorded in the SRB. 

Speculated instructions execute normally unless they trap. If a speculated instruction 
traps, the exception bit in the SRB which corresponds to the trapping instruction is set 
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and program execution continues. Subsequent instructions that use the result of the 
trapping instruction are allowed to execute normally. 

A chk.except(k) instruction is placed in the home block of each speculated instruction. 
Only one chk- except (k) instruction is required for a home block. As the name implies, 
chk.except(k) checks for pending exceptions. The command can simultaneously interro- 
gate each location in the SRB by utilizing the bit field k. As shown in schedule 1 of 
Figure 6.4, chk-except(OOllll) in /' checks for exceptions for instructions Id and Id- If a 
checked exception bit is set, the SRB is flushed in reverse order, restoring the appropriate 
register and PC values. Execution can then begin with the excepting instruction. 

Figure 6.4 illustrates several on-path hazards which are resolved by the SRB. In 
schedule 1, if U traps and the branch I e commits to the taken path, /, has corrupted r 2 
and // has corrupted r 7 . Flushing the SRB up through /, restores both registers to their 
values prior to the initial execution of /,. Note that register r 6 is also corrupted but not 
restored by the SRB, since after rollback r 6 will be rewritten with a correct value before 
the corrupted value is used- 

Instead of checking for exceptions in each home block, the exception could be handled 
when the exception bit reaches the bottom of the SRB. This is very similar to the re- 
order b uff er used in dynamic scheduling [14]. This eliminates the cost of the chk-except(k) 
command, however, and increases the exception handling latency which can impact per- 
formance depending on the frequency of exceptions. In addition, the technique guarantees 
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that exceptions will be processed in the original home block order. For example, in sched- 
ule 2 of Figure 6.4, if both I, and 7/ trap, ckk-except(k) ensures that 7/ will be handled 
first. 

Implicit index scheduling derives its name from the ability of the compiler to locate a 
particular register value within the SRB. This is possible only if the dynamically occurring 
history of speculated instructions is deterministic at branch boundaries. Superblocks 
guarantee this by ensuring that the sole entry into the superblock is at the header and 
by limiting speculation to within the superblock. For standard blocks, bookkeeping code 
[40] can be used to ensure this deterministic behavior. 

6.4.2 Branch repair using the speculation read buffer 

Branch repair can be handled by removing branch hazards with the compiler. As 
shown in Chapter 3, branch hazard removal in multiple instruction rollback can be as- 
sisted by the read buffer when u covering” on-path hazards are present, reducing the per- 
formance cost of variable renaming. In a similar fashion, the SRB can assist in branch 
repair. Figure 6.5 shows the original code schedule and the two speculative schedules of 
Figure 6.4. For this example, it is assumed that r 2 , r 3 , r 6 , and r 7 are elements in both 
liveJn(Ij) and liveJn(Ik)- 

As shown in schedule 1, if branch instruction 7 C commits to the taken path, r 2 , r 6 , 
and r 7 , which were modified in 7j, /<*, and //, respectively, must be restored. If instead, 
I c commits to the fall-through path and I g commits to the taken path, only r 2 must 
be restored. Registers r 2 and r 7 are rollback hazards that result from exception repair; 
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Figure 6.5: Branch repair using a speculation read buffer (SRB). 
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therefore, the SRB contains their unmodified values. By including a flush (k) command at 
the target of I c and I g , the SRB can be used to restore r 2 and/or r 7 given a misprediction 

of I c or Ig. 

The flush (k) command selectively flushes the appropriate register values given a 
branch misprediction. For example, in schedule 2 of Figure 6.5, if I e is predicted cor- 
rectly and I, is mispredicted, the SRB is flushed in reverse order up through restoring 
value(r 3 ) from but not restoring value{r 7 ) from //. Since speculation is always from 
the most probable branch path, the flushflt) command is always placed on the most 
improbable branch path, minimizing the performance penalty. Not all branch hazards 
are resolved by the presence of on-path hazards. These remaining hazards can either 
be resolved with compiler transformations or by inserting MOV r 0 , r x instructions as 
described in Section 3.3. It would be necessary to mark the MOV instruction speculated 
to ensure that v x is loaded into the SRB. 

6.5 SRB Flush Penalty 

The examples of Section 6.4 demonstrate that the compiler-assisted multiple instruc- 
tion rollback scheme presented in Chapter 3 can be applied to both branch repair and 
exception repair in a speculative execution architecture. The flush penalty of the read 
buffer is not a key concern in multiple instruction rollback applications since instruction 
faults are typically very rare. In application to exception repair in speculative execution, 
the SRB flush penalty is also not a major concern due to the infrequency of exceptions 
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involving speculated instructions. However, in application to branch repair, the SRB 
flush penalty could produce significant performance impacts. Studies of branch behav- 
ior show a conditional branch frequency of 11% to 17% [34]. Static branch prediction 
methods result in branch mispredictions in the range of 5% to 15%. This results in a 
branch repair frequency as high as 2.5%. Assuming a CPI (clock cycles per instruction) 
rate of one and an average SRB flush penalty of ten cycles, the performance overhead 
of the flush mechanism would reach 22.5%. This indicates the importance of minimizing 
the amount of redundant data stored in the SRB so that the flush penalty is reduced. 

In Chapter 5, several read buffer configurations were studied. A technique was pro- 
posed to reduce the amount of redundant data in the read buffer so that the read buffer 
size could be reduced. A similar technique can be used to assure that only the data 
required for branch and exception repair is stored in the SRB. In the implicit index 
scheme of Section 6.4, a bit indicating whether an instruction is speculated is added to 
the opcode field. By expanded this field to two bits, operand storage requirements can be 
specified. Figure 6.6 shows the reduced contents of the SRB given schedule 1 of Figure 
6.5. In the modified scheme, only the first read of rj must be maintained. Register rg 
is not required since it was not modified. The improved scheme also eliminates blank 
spaces in the SRB. For this example, the misprediction of I e in schedule 1 of Figure 6.5 
results in four less variables to flush. 

The coding of the two speculation bits would be as follows: 00) no save required, 
01) save operand 1, 10) save operand 2, and 11) save both operands. If neither operand 
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SRB Contents 

Figure 6.6: SRB with reduced content. 

of a speculated instruction has be saved in the SRB, the instruction is not marked as 
speculated. This is not a problem for branch repair: however, if such an instruction traps, 
the hardware would have no way of knowing not to handle the exception immediately. 
There would also be no entry in the SRB for the exception bit or for the corresponding 
PC value. One solution to the problem would be to add another bit to the opcode field 
which marks speculated trapping instructions. A better solution is to code all speculated 
trapping instructions as 01. This will indicate that exception handling is to be delayed 
and cause a reservation of an entry in the SRB. This latter approach will slightly increase 
the flush penalty during branch repairs. Separate SRBs could be maintained for branch 
and exception repairs. 
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Figure 6.7: Instrumentation code placement. 


6.6 Performance Evaluation 
6.6.1 Evaluation methodology 

In this section, the read buffer flush penalty is evaluated using a similar strategy to 
the one presented in Chapter 5. The instrumentation code segments of Figure 6.7 call a 
branch error procedure which performs the following functions: 


1. Update the read buffer model. 

2. Force actual branch errors during program execution, allowing execution to proceed 
along an incorrect path for a controlled number of instructions. 

3. Terminate execution along the incorrect path and restore the required system state 
from the simulated read buffer. 


4. Measure the resulting flush cycles during the branch repair. 

5. Begin execution along the correct path until the next branch is encountered. 
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An example instrumentation code segment is shown in Figure 6.8. Parameters, such 
as operand saving information, current PC, branch faU-though PC, and branch target 
PC values, are passed by the instrumentation code to the branch error procedure. An 
additional miscellaneous parameter contains instruction type and information used for 

debugging. 

Figure 6.9 gives a high level flow of operation for branch error procedure. When a 
branch instruction in the original application program is encountered, an arm-branch flag 
is set. Prior to the execution of the next application instruction, the arm-branch flag is 
checked, and if set, the branch decision made by the application program is set aside. 
The branch is then predicted by the branch prediction model. Four models are used in 
the evaluation: 1) predict taken, 2) predict not taken, 3) dynamic prediction, and 4) 
static prediction from profiling information. The dynamic prediction model is derived 
from a two bit counter branch target buffer (BTB) design [45] and is the only model that 

requires updating with each prediction outcome. 

After the branch is predicted, the prediction is checked against the actual branch 
path taken by the application program. If the prediction was correct execution proceeds 
normally. If the prediction was incorrect, the correct branch path is loaded into the 
recovery queue along with a branch error detection (BED) latency, and the predicted path 
is loaded into the PC. The BED latency indicates how long the execution of instructions is 
to continue along the incorrect path. The branch error time-out flag is set when the BED 
latency is reached. When a branch error is detected, the register file state is repaired by 
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$ simlb_2_24_0 : 

T instruction 24 

* Begin brsim sim hook: si - 16, s2 


subu ?sp, 
la $at, 
sw $at, 
la $et, 
sw $at, 
la $at , 
sw $at, 
li $at, 
sw $at, 
li $at, 
sw $at, 
move $at, 

# End brsim_sim hook. 
$ simlb 2 24 1: 


44 

$ simlb_2_24_0 
2TT($sp) 

$_simlb_2_24_l 
24 ($sp) 

$ simlb_2_2 5_0 
2~S ($sp) 

8216 

32 ($,sp) 

16 — 


brsim save 


40 ($sp) 
$sp 


addu 


$16, 


$16, 


$_simlb_2_25_0 : 

# instruction 25 

# Begin brsim sim hook: si - 16, s2 - 9: 


subu 
la 
sw 
la 
sw 
la 
sw 
li 
sw 
la 
sw 
li 
sw 

move 

j 

# End brsim_sim hook 
$ simlb 2 25 1: 


?sp, 
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$at, 
$at , 
$at , 
$at, 
$at, 
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$at , 
$at , 
$at , 
$at , 
$at , 


44 

$ simlb_2_25_0 
275 ($sp) 

$_simlb_2_25_l 
24 ($sp) 

$ main_6 < 
2l($sp) 

532505 

32 ($sp) 

$ main_5 ■« 
3^($sp) 

304 


brsim save 


40 ($sp) 
$sp 


bne 


$16, 


$9, 


$ main 5 


normal 

- hook address 

— instruction adress 

- next hook address 

— miscellaneous 


directs read buffer to save 
register 16 


original instruction 


branch 

— hook address 

— instruction adress 

— next hook address 

— miscellaneous 

— target address 


directs read buffer to save 
registers 16 and 9 


original instruction 


$ main 6: 


Figure 6.8: Instrumentation code sequences. 
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PC - pro gr a m counter 

GPRF - general purpose register file 

RB - read buffer 

BPM - branch prediction model 


Figure 6.9: Branch error procedure operation. 
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the read buffer. The PC value of the correct branch path is obtained from the recovery 
queue. The number of cycles required to flush the read buffer during branch repair is 

also recorded. 

It is assumed for this evaluation that two read buffer entries can be flushed in a single 
cycle. This corresponds to the split-cycle-save assumption of Chapter 5. Performance 
overhead due to read buffer flushes (% increase) is computed as 


FlushJOH = 100 * 


flu3h.cycles 

totalucycles 


All instructions are assumed to require one cycle for execution. This is conservative 
since the MIPS processor used for the evaluation requires two cycles for a load. The 
additional cycles would increase the total jcycles and thereby reduce the observed perfor- 
mance overhead. In addition to accurately measuring flush costs, the evaluation verifies 
the operation of the read buffer and its ability to restore the appropriate system state 
over a wide range of applications. 

The instrumentation insertion transformation operates on the s-code emitted by the 
MIPS code generator of the IMPACT C compiler [32). The transformation determines 
which operands require saving in the read buffer and inserts calls to the initialization , 
branch error , and summary procedures. Initialization and summary calculations are 
handled as in Chapter 5. The resulting s-code modules are then compiled and run on a 
DECstation 3100. For the evaluation, BED latencies from 1 to 10 were used. Table 5.1 
(p. 87) lists the ten application programs evaluated. 
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6.6.2 Evaluation results 

Experimental measurements of read buffer flush overhead (Flush OH) for various 
BED latencies are shown in Figures 6.10 through 6.14 (pp. 122 through 124). The four 
branch prediction strategies used for the evaluation are referred to as: 1) predict taken 
(P.Taken), 2) predict not taken ( PJf.Taken ), 3) dynamic prediction based on a branch 
target buffer ( Dyn-Pred), and 4) static branch prediction using profiling data ( Profited ). 

As expected, flush costs were closely related to branch prediction accuracies, i.e., the 
more often a branch was mispredicted, the more often flush costs were incurred. In a 
speculative execution architecture, branch prediction inaccuracies result in performance 
impacts in addition to the impacts from the branch repair scheme. Branch mispredic- 
tion increases the base run time of an application by permitting speculative execution 
of unproductive instructions. Increased levels of speculation increase the performance 
impacts associated with branch prediction inaccuracies. Only the performance impacts 
associated read buffer flushes are shown in Figures 6.10 through 6.14. 

For nine of the ten applications, P-N.Taken was significantly more accurate or marginally 
more accurate in predicting branch outcomes than P -Taken. For QSORT, P- Taken was 
significantly more accurate than P-N- Taken. This result demonstrates that in a spec- 
ulative execution architecture, it is difficult to guarantee optimal performance across a 
range of applications given a choice between predict-taken and predict-not-taken branch 
prediction strategies. 
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For all but one application, Prof-Pred was more accurate than either P -Taken or 
p_N_ Taken. For CMP, Prof-Pred, P-N-Taken, and Dyn-Pred were nearly perfect m their 
prediction of branch outcomes. Prof-Pred marginally outperformed Dyn-Pred in all ap- 
plications except LEX, 

The purpose of measuring read buffer flush costs given the recovery from injected 
branch errors is to establish the viability of using a read buffer design for branch repair 
for speculative execution. Although in such a speculative schedule only static prediction 
strategies would be applicable, the Dyn-Pred model was included to better assess how 
varying branch prediction strategies impact flush costs. Overall, the accuracy of Dyn-Pred 

fell between P- Taken/ P-N- Taken and Prof-Pred. 

Over the ten applications studied, read buffer flush overhead ranged from 49.91% for 
the P -Taken strategy in CCCP to .01% for the P-N-Taken strategy for CMP given a 
BED of ten. It can be seen from Figures 6.10 through 6.14 that a good branch prediction 
strategy is key to a low read buffer flush cost. The results show that given a static branch 
prediction strategy using profiling data, an average BED of ten produces flush costs 
no greater than 14.81% and an average flush cost of 8.12% across the ten applications 
studied. This performance overhead is comparable to the overhead expected from a 
delayed write buffer scheme with a maximum allowable BED of ten [12]. However, given 
a maximum BED of ten and an average BED of less than ten, the flush costs of the read 
buffer would be less than that of a delayed write buffer, because a delayed write buffer 
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is designed for a worst-case BED and the flush penalty of a read buffer is based on the 
average BED. 

The BED for a given branch in this evaluation corresponds to the number of instruc- 
tions moved above a branch in a speculative schedule. The results of the evaluation 
indicate that if the average number of instructions speculated above a given branch is < 
10, then the read buffer becomes a viable approach to handling branch repair. 


Flush OH 
(%) 



Flush OH 
<*) 


50H 


40 -I 


30H 


P_Taken: -©- 

P J^_Taken:-o- 
Dyn_Pred: 

Prof Pred: -a- 



1 — i — i — i — i — i — i — i — r 
23456789 10 
BED Latency 


Figure 6.10: Flush penalty: QUEEN, WC. 




i 23456789 10 
BED Latency 


123456789 10 
BED Latency 


Figure 6.12: Flush penalty: PUZZLE, QSORT. 



i 2 3 4 5 6 7 8 9 10 
BED Latency 


i 23456789 10 
BED Latency 


Figure 6.14: Flush penalty: YACC, CCCP. 
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6.7 Summary 

Speculative execution has been shown to be an effective method to create addi- 
tional instruction level parallelism in general applications. Speculating instructions above 
branches requires schemes to handle mispredicted branches and speculated instructions 

that trap. 

This chapter showed that branch hazards resulting from branch mispredictions are 
similar to branch hazards in multiple instruction rollback recovery. It was shown that 
compiler techniques similar to those presented in Chapter 3 can be used as an effective 
branch repair scheme in a speculative execution architecture. It was also shown that 
data hazards as a result of rollback due to exception repair are similar to on-path hazards 
described in Chapter 3, indicating that a read buffer approach to exception repair was 

viable. 

Implicit index scheduling was introduced to exploit the unique characteristics of roll- 
back recovery using a read buffer approach. The read buffer design was expanded to 
include PC values to aid in rollback from excepting speculated instructions. Similar to 
“covering” on- path hazards discussed in Chapter 3, the read buffer was shown to resolve 
some branch hazards without the need for compiler transformations. 

Read buffer flush penalties were measured by injecting branch errors into ten target 
applications and measuring the flush cycles required to recover from the branch errors 
using a simulated read buffer. It was shown that with a static branch prediction strategy 


using profiling data, flush costs under 15% are achievable. The results of these evalua- 
tions indicate that the compiler-assisted multiple instruction rollback scheme presented 
in Chapter 3 is viable in application to branch and exception repair in a speculative 


execution architecture. 
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7. CONCLUSIONS 


7.1 Summary 

This thesis has presented a compiler-assisted multiple instruction rollback scheme 
which combines compiler-driven data-flow manipulations with dedicated data redundancy 
hardware to remove all data hazards that result from multiple instruction rollback. Ex- 
perimental evaluation of the proposed compiler-assisted scheme with a maximum rollback 
distance of ten showed performance impacts of no more than 6.57% and an average im- 
pact of 1.80%, over the ten application programs studied. The performance evaluation 
indicates lower performance penalties than for previous compiler-only approaches or com- 
parable hardware-only approaches. Compiler transformations used for hazard removal 
have been enhanced reducing application code growth and compile times, and in some 
cases improving application execution performance. Ten read buffer configurations were 
studied to determine the minimum size requirement for general applications. It was 
found that a 55% read buffer size reduction is achievable with an average reduction of 
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39.5%, but that additional control logic to handle read buffer overflows may limit the 
overall hardware savings. It was also shown that the proposed compiler-assisted multi- 
ple instruction rollback technique can be applied to speculative execution repair. The 
problems associated with recovery from mispredicted branches and excepting speculated 
instructions were shown to be similar to problems encountered with multiple instruc- 
tion rollback recovery. A speculative scheduling scheme was proposed which utilizes 
compiler-driven hazard removal transformations and the read buffer to aid in hazard 
removal during exception handling and mispredicted branch handling. 

7.2 Limitations 

The compiler-assisted rollback recovery scheme presented limits system state space 
restoration to the register file. Other methods, such as history buffers, would be required 
to maintain an N cycle history of the program counter and program status word. Cache 
memory and main memory would require an N cycle rollback capability which could 
be implemented with an N depth delayed write buffer. Functional units such as the 
floating point unit and I/O units would require a rollback capability or a capability 
to be flushed and restarted. A spontaneous change to the contents of the register file 
is not recoverable by the compiler-assisted recovery scheme, although the propagation 
of such errors is recoverable if the errors are detected within N cycles. For enhanced 
fault tolerance, error detection/correction codes could be used in the register file. A 
similar limitation exists for cache and main memories.. Unlike the previously mentioned 
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limitations, the requirement that an error does not cause an illegal path in the control- 
flow graph of the program is unique to compiler-assisted rollback recovery. For enhanced 
fault tolerance, a control-flow error detection mechanism with a latency no greater than 
one cycle would be required. Finally, the compiler-assisted rollback recovery scheme 
requires recompilation of application programs and libraries. 

7.3 Future Research 

The use of profile data can be extended to the register allocation phase and is expected 
to result in further reduction in performance overhead. Application of compiler- assisted 
multiple instruction rollback recovery to super-scalar, VLIW, and parallel processing ar- 
chitectures is an area with great potential. Given the flexibility of the IMPACT compiler 
platform used for current hazard removal transformations, studies of rollback recovery 
schemes for the three architectures are feasible and should produce near-term results. 
Further evaluations of compiler-assisted rollback recovery applied to speculative execu- 
tion repair would include modifying compiler transformations to operate in a super-scalar 
and VLIW environment. Again, the flexibility of the IMPACT compiler platform should 
simplify this investigation. An additional extension would be to develop and evaluate 
a scheme which handles both instruction rollback recovery and speculative execution 
repair. 
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